Since its introduction by Akaike in the early seventies 
, the celebrated Akaike’s Information Criterion (AIC) has been an essential tool for the statistician and its use is almost systematic in problems of model selection and estimator selection for prediction. By choosing among estimators or models constructed from finite degrees of freedom, the AIC recommends more specifically to maximize the log-likelihood of the estimators penalized by their corresponding degrees of freedom. This procedure has found pathbreaking applications in density estimation, regression, time series or neural network analysis, to name a few (
). Because of its simplicity and negligible computation cost—whenever the estimators are given—, it is also far from outdated and continues to serve as one of the most useful devices for model selection in high-dimensional statistics. For instance, it can be used to efficiently tune the Lasso ().
Any substantial and principled improvement of AIC is likely to have a significant impact on the practice of model choices and we bring in this paper an efficient and theoretically grounded solution to the problem of overfitting that can occur when using AIC on small to medium sample sizes.
have proposed the so-called AICc (for AIC corrected), which tends to penalize more than AIC. However, the derivation of AICc comes from an asymptotic analysis where the dimension of the models are considered fixed relative to the sample size. In fact, such an assumption does not fit the usual practice of model selection, where the largest models are of dimensions close to the sample size. Another drawback related to AICc is that it has been legitimated through a mathematical analysis only in the linear regression model and for autoregressive models (). However, to the best of our knowledge, outside these frameworks, there is no theoretical ground for the use of AICc.
Building on considerations from the general nonasymptotic theory of model selection developed during the nineties (see for instance  and ) and in particular on Castellan’s analysis , Birgé and Rozenholc  have considered an AIC modification specifically designed for the selection of the bin size in histogram selection for density estimation. Indeed, results of —and more generally results of —advocate to take into account in the design of penalty the number of models to be selected. The importance of the cardinality of the collection of models for model selection is in fact a very general phenomenon and one of the main outcomes of the nonasymptotic model selection theory. In the bin size selection problem, this corresponds to adding a small amount to AIC. Unfortunately, the theory does not specify uniquely the term to be added to AIC. On the contrary, infinitely many corrections are accepted by the theory and in order to choose a good one, intensive experiments were conducted in . The resulting AIC correction therefore always has the disadvantage of being specifically designed for the task on which it has been tested.
We propose a general approach that goes beyond the limits of the unbiased risk estimation principle. The latter principle is indeed at the core of Akaike’s model selection procedure and is more generally the main model selection principle, which underlies procedures such as Stein’s Unbiased Risk Estimator (SURE, ) or cross-validation (
). We point out that it is more efficient to estimate a quantile of the risk of the estimators—the level of the quantile depending on the size of the collection of models—than its mean. We also develop a (pseudo-)testing point of view, where we find that unbiased risk estimation does not in general allow to control the sizes of the considered tests. This is thus a new, very general model selection principle that we put forward and formalize. We call it an over-penalization procedure, because it systematically involves adding small terms to traditional penalties such as AIC.
Our procedure consists in constructing an estimator from multiple pseudo-tests, built on some random events. From this perspective, our work shares strong connections recent advances in robust estimation by designing estimators from tests ([9, 20, 21, 10, 39]). However, our focus and perspectives are significantly different from this existing line of works. Indeed, estimators such as T-estimators () or -estimators () are quasi-universal estimators in the sense that they have very strong statistical guarantees, but they have the drawback to be very difficult—if feasible—to compute. In particular, 
also builds estimators from frequency histograms, but to our knowledge no implementation of such estimators exists and it seems to be an open question whether a polynomial time algorithm can effectively compute them or not. Here, we rather keep the tractability of AIC procedure, but we don’t look particularly at robustness properties (for instance against outliers). We focus on improving AIC in the nonasymptotic regime.
In addition, it is worth noting that several authors have examined some links between multiple testing and model selection, in particular by making some modifications to classical criteria (see for instance [34, Chapter 7]). But these lines of research differ significantly from our approach. Indeed, the first and main use in the literature of multiple testing point of view for model selection concerns variable selection, i.e. the identification of models, particularly in the context of linear regression. ([13, 27, 49, 56, 1, 16]). It consists in considering simultaneously the testing of each variable being equal to zero or not. Instead, we consider model selection from a predictive perspective and do not focus on model identification. It should also be noted that multiple tests may be considered after model selection or at the same time as selective inference ([47, 37, 17]), but these questions are not directly related to the scope of our paper.
Lets us now detail our contributions.
We propose a general formulation of the model selection task for prediction in terms of a (pseudo-)test procedure (understood in a non classical way which will be detailed in the Section 2.3.2), thus establishing a link between two major topics of contemporary research. In particular, we propose a generic property that the pseudo-tests collection should satisfy in order to ensure an oracle inequality for the selected model. We call this property the “transitivity property” and show that it generalizes penalization procedures together with T-estimators and -estimators.
Considering the problem of density estimation by selecting a histogram, we prove a sharp and fully nonasymptotic oracle inequality for our procedure. Indeed, we describe a control of Kullback-Leibler (KL) divergence—also called excess risk—of the selected histogram as soon as we have one observation. We emphasize that this very strong feature may not be possible when considering AIC. We also stressed that up to our knowledge, our oracle inequality is the first nonasymptotic result comparing the KL divergence of the selected model to the KL divergence of the oracle in an unbounded setting. Indeed, oracle inequalities in density estimation are generally expressed in terms of Hellinger distance—which is much easier to handle than the KL divergence, because it is bounded—for the selected model.
In order to prove our oracle inequality, we improve upon the previously best known concentration inequality for the chi-square statistics (Castellan , Massart ) and this allows us to gain an order of magnitude in the control of the deviations of the excess risks of the estimators. Our result on the chi-square statistics is general and of independent interest.
We also prove new Bernstein-type concentration inequalities for log-densities that are unbounded. Again, these probabilistic results, which are naturally linked to information theory, are general and of independent interest.
Finally, from a practical point of view, we bring a nonasymptotic improvement of AIC that has, in its simplest form, the same computational cost as AIC. Furthermore, we show that our over-penalization procedure largely outperforms AIC on small and medium sample sizes, but also surpasses existing AIC corrections such as AICc or Birgé-Rozenholc’s procedure.
Let us end this introduction by detailing the organization of the paper.
We present our over-penalization procedure in Section 2. More precisely, we detail in Sections 2.1 and 2.2 our model selection framework related to MLE via histograms. Then in Section 2.3 we define formally over-penalization procedures and highlight their generality. We explain the ideas underlying over-penalization from three different angles: estimation of the ideal penalty, pseudo-testing and a graphical point of view.
Section 3 is devoted to statistical guarantees related to over-penalization. In particular, as concentration properties of the excess risks are at the heart of the design of an over-penalization, we detail them in Section 3.1. We then deduce a general and sharp oracle inequality in Section 3.2 and highlight the theoretical advantages compared to an AIC analysis.
New mathematical tools of a probabilistic and analytical nature and of independent interest are presented in Section 4. Section 5 contains the experiments, with detailed practical procedures. We consider two different practical variations of over-penalization and compare them with existing penalization procedures. The superiority of our method is particularly transparent.
2 Statistical Framework and Notations
The over-penalization procedure, described in Section 2.3
, is legitimated at a heuristic level within a generic M-estimator selection framework. We put to emphasis in Section2.1 on maximum likelihood estimation (MLE) since, as proof of concept for our over-penalization procedure, our theoretical and experimental results will address the case of bin size selection for maximum likelihood histogram selection in density estimation. In order to be able to discuss in Section 2.3 the generality of our approach in an M-estimation setting, our presentation of MLE brings notations which extend directly to M-estimation with a general contrast.
2.1 Maximum Likelihood Density Estimation
We are given independent observations with unknown common distribution on a measurable space
. We assume that there exists a known probability measureon such that admits a density with respect to : . Our goal is to estimate the density .
For an integrable function on , we set and . If denotes the empirical distribution associated to the sample , then we set . Moreover, taking the conventions , and defining the positive part as , we set
We assume that the unknown density belongs to .
Note that since , the fact that belongs to is equivalent to , the space of integrable functions on with respect to .
We consider the MLE of the density . To do so, we define the maximum likelihood contrast to be the following functional,
Then the risk associated to the contrast on a function is the following,
Also, the excess risk of a function with respect to the density is classically given in this context by the KL divergence of with respect to
. Recall that for two probability distributionsand on of respective densities and with respect to , the KL divergence of with respect to is defined to be
By a slight abuse of notation we denote rather than and by the Jensen inequality we notice that is a nonnegative quantity, equal to zero if and only if - Hence, for any , the excess risk of a function with respect to the density satisfies
and this nonnegative quantity is equal to zero if and only if Consequently, the unknown density is uniquely defined by
For a model , that is a subset , we define the maximum likelihood estimator on , whenever it exists, by
2.2 Histogram Models
The models that we consider here to define the maximum likelihood estimators as in (2) are made of histograms defined on a fixed partition of . More precisely, for a finite partition of of cardinality , , we set
Note that the smallest affine space contained in is of dimension . The quantity
can thus be interpreted as the number of degrees of freedom in the (parametric) model. We assume that any element of the partition is of positive measure with respect to : for all , . As the partition is finite, we have for all and so . We state in the next proposition some well-known properties that are satisfied by histogram models submitted to the procedure of MLE (see for example [43, Section 7.3]).
Then and is called the KL projection of onto . Moreover, it holds
The following Pythagorean-like identity for the KL divergence holds, for every ,
The maximum likelihood estimator on is well-defined and corresponds to the so-called frequencies histogram associated to the partition . We also have the following formulas,
Histogram models are special cases of general exponential families exposed for example in Barron and Sheu  (see also Castellan  for the case of exponential models of piecewise polynomials). The projection property (3) can be generalized to exponential models (see [12, Lemma 3] and Csiszár ).
Now let’s define our model selection procedure. We propose three ways to understand the benefits of over-penalization. Of course, the three points of view are interrelated, but they provide different and complementary insights on the behavior of over-penalization.
2.3.1 Over-Penalization as Estimation of the Ideal Penalty
We are given a collection of histogram models denoted , with finite cardinality depending on the sample size , and its associated collection of maximum likelihood estimators . By taking a (nonnegative) penalty function on ,
the output of the penalization procedure (also called the selected model) is by definition any model satisfying,
We aim at selecting an estimator with a KL divergence, pointed on the true density , as small as possible. Hence, we want our selected model to have a performance as close as possible to the excess risk achieved by an oracle model (possibly non-unique), defined to be,
since in this case, the criterion is equal to the true risk . However is unknown and, at some point, we need to give some estimate of it. In addition, is random, but we may not be able to provide a penalty, even random, whose fluctuations at a fixed model would be positively correlated to the fluctuations of . This means that we are rather searching for an estimate of a deterministic functional of . But which functional would be convenient? The answer to this question is essentially contained in the solution of the following problem.
Problem 1. For any fixed find the deterministic penalty , that minimizes the value of , among constants which satisfy the following oracle inequality,
The solution—or even the existence of a solution—to the problem given in (7) is not easily accessible and depends on assumptions on the law of data and on approximation properties of the models, among other things. In the following, we give a reasonable candidate for . Indeed, let us set Card and define
where is the quantile of level
for the real random variable. Our claim is that gives in (7) a constant which is close to one, under some general assumptions (see Section 3 for precise results). Let us explain now why should lead to a nearly optimal model selection.
We see, by definition of and by a simple union bound over the models , that the event is of probability at least . Now, by definition of , we have, for any ,
By centering by and using simple algebra, Inequality (9) can be written as,
Now, on , we have , so we get on ,
Specifying to the MLE context, the latter inequality writes,
In order to get an oracle inequality as in (7), it remains to control and in terms of the excess risks and . Quantity is related to deviations bounds for the true and empirical excess risks of the M-estimators and quantity is related to fluctuations of empirical bias around the bias of the models. Suitable controls of these quantities (as achieved in our proofs) will give sharp oracle inequalities (see Section 3 below).
Notice that our reasoning is not based on the particular value of the contrast, so that to emphasize this point we choose to keep in most of our calculations rather than to specify to the KL divergence related to the MLE case. As a matter of fact, the penalty given in (8) is a good candidate in the general context of M-estimation.
We define an over-penalization procedure as follows.
A penalization procedure as defined in (4) is said to be an over-penalization procedure if the penalty that is used satisfies for all and for some .
Based on concentration inequalities for the excess risks (see Section 3.1) we propose the following over-penalization penalty for histogram selection,
where is a constant that should be either fixed a priori ( or are typical choices) or estimated using data (see Section 5 for details about the choice of ). The logarithmic terms appearing in (10) are linked to the cardinal of the collection of models, since in our proofs we take a constant such that . The constant then enters in the constant of (10). We show below nonasymptotic accuracy of such procedure, both theoretically and practically.
2.3.2 Over-Penalization through a pseudo-testing approach
This task can be formulated by solving iterative pseudo-tests. Indeed, set the following collection of null and alternative hypotheses indexed by pairs of models: for ,
where the constant is uniform in . To each pair , let us assume that we are given a test that is equal to one if is rejected and zero otherwise.
It should be noted at this stage that what we have just called “pseudo-test” does not enter directly into the classical theory of statistical tests, since the null and alternative hypotheses that we consider are random events. However, as we will see, the only notion related to our pseudo-tests needed for our model selection study is the notion of the “size” of a pseudo-test, which we will give in the following and which will provide a mathematically consistent analysis of the statistical situation. Moreover,it seems that random assumptions naturally arise in some statistical frameworks. For instance, in the context of variable selection along along the Lasso path, Lockhart et al. consider sequential random null hypotheses based on the active sets of the variable included up to a step on the Lasso path (see especially [41, Section 2.6] as well as the discussion of Bühlmann et al. [19, Section 2] and the rejoinder [40, Section 2]).
Finally, if the testing of random hypotheses disturbs the reader, we suggest taking our multiple pseudo-testing interpretation of the model selection task in its minimal sense, that is, as a mathematical description aiming at investigating tight conditions on the penalty, that would allow for near-optimal oracle guarantees for the penalization scheme (4).
In order to ensure an oracle inequality such as in (7), we want to avoid as far as possible selecting a model whose excess risk is far greater than the one of the oracle. In terms of the preceding tests, we will see that this exactly corresponds to controlling the “size” of the pseudo-tests .
Let us note the event where the pseudo-test
rejects the null hypothesisand the event where the hypothesis is true. By extension with the classical theory of statistical testing, we denote the size of the pseudo-test , given by
To explain how we select a model that is close to the oracle model , let us enumerate the models of the collection: . Note that we do not assume that the models are nested. Now, we test for increasing from to . If there exists such that , then we perform the pseudo-tests for increasing from to or choose as our oracle candidate if . Otherwise, we choose as our oracle candidate. In general, we can thus define a finite increasing sequence of models and a selected model through the use of the collection of pseudo-tests . Also, the number of pseudo-tests that are needed to define is equal to .
Let us denote the set of pairs of models that have effectively been tested along the iterative procedure. We thus have . Now define the event under which there is a first kind error along the collection of pseudo-tests in ,
and assume that the sizes of the pseudo-tests are chosen such that
Assume also that the selected model has the following Transitivity Property (TP):
If for any then .
We believe that the transitivity property (TP) is intuitive and legitimate as it amounts to assuming that there is no contradiction in choosing as a candidate oracle model. Indeed, if a model is thought to be better than a model , that is , then the selected model should also be thought to be better that , .
The power of the formalism we have just introduced lies in the fact that the combination of Assumptions (11) and (TP) automatically implies that an oracle inequality as in (7) is satisfied. Indeed, property (TP) ensures that because on the event , we have , so there exists such that and (otherwise ) which in turn ensures . Consequently, , which gives
that is equivalent to (7).
Let us turn now to a specific choice of pseudo-tests corresponding to model selection by penalization. If we define for a penalty , the penalized criterion , and take the following pseudo-tests
then it holds
and property (TP) is by consequence satisfied.
It remains to choose the penalty such that the sizes of the pseudo-tests are controlled. This is achieved by taking as defined in (8), with . Indeed, in this case,
In line (13), the equality is only approximated since we neglected the centering of model biases by their empirical counterparts, as these centered random variables should be small compared to the other quantities for models of interest. Now assume that
for some deterministic sequence not depending on . Such result is obtained in Section 3.1 and is directly related to the concentration behavior of the true and empirical excess risks. Then we get
where the last equality is valid if . In this case,
There is a gap between the penalty considered in Section 2.3.1 to ensure an oracle inequality and the penalty defined in this section. This comes from the fact that using the pseudo-testing framework, we aim at controlling the probability of the event under which there is a first kind error along the pseudo-tests performed in . Despite the fact that the set consists in pseudo-tests, we give a bound that takes into account the pseudo-tests defined from the pairs of models. There is a possible loss here that consists in inflating the set in order to make the union bound valid. However, such loss would only affect the constant in our over-penalization procedure (10) by a factor , since the modification of the order of the quantile affects the penalty through a logarithmic factor.
The Transitivity Property (TP) allows to unify most the selection rules. Indeed, as soon as we want to select an estimator (or a model) that optimizes a criterion,
where is a collection of candidate functions (or models), then is also defined by the collection of tests . In particular, in T-estimation () as well as in -estimation ) the estimator is indeed a minimizer of a criterion that is interpreted as a diameter of a subset of functions in . Note that in T-estimation, the criterion is itself constructed through the use of some (robust) tests, that by consequence do not act at the same level as our tests in this case.
2.3.3 Graphical insights on over-penalization
Finally, let us provide a graphic perspective on our over-penalization procedure.
If the penalty is chosen accordingly to the unbiased risk estimation principle, then it should satisfy, for any model ,
In other words, the curve fluctuates around its mean, which is essentially the curve , see Figure 2. Furthermore, the largest is the model , the largest are the fluctuations of . This is seen for instance through the concentration inequalities established in Section A.1 below for the empirical excess risk . Consequently, it can happen that the curve is rather flat for the largest models and that the selected model is among the largest of the collection, see Figure 2.
By using an over-penalization procedure instead of the unbiased risk estimation principle, we will compensate the deviations for the largest models and thus obtain a thinner region of potential selected models, see Figures 3 and 4. In other words, we will avoid overfitting and by doing so, we will ensure a reasonable performance of our over-penalization procedure in situations where unbiased risk estimation fails. This is particularly the case when the amount of data is small to moderate.
3 Theoretical Guarantees
We state here our theoretical results related to the behavior of our over-penalization procedure.
As explained in Section 2.3, concentration inequalities for true and empirical excess risks are essential tools for understanding our model selection problem. As the theme of concentration inequalities for the excess risk constitutes a very recent and exciting area of research, these inequalities also have an interest in themselves and we state them out in Section 3.1.
In Section 3.2, we give a sharp oracle inequality proving the optimality of our procedure. We also compare our result to what would be obtained for AIC, which suggests the superiority of over-penalization in the nonasymptotic regime, i.e. for a small to medium sample size.
3.1 True and empirical excess risks’ concentration
In this section, we fix the linear model made of histograms and we are interested by concentration inequalities for the true excess risk on and for its empirical counterpart
Let , and let and be positive constants. Take a model of histograms defined on a fixed partition of . The cardinality of is denoted by Assume that and
If , then a positive constant exists, only depending on and , such that by setting
we have, on an event of probability at least ,
In the previous theorem, we obtain sharp upper and lower bounds for true and empirical excess risk on . They are optimal at the first order since the leading constants are equal in the upper and lower bounds. They show the concentration of the true and empirical excess risks around the value . Moreover, Theorem 3.1 establishes equivalence with high probability of the true and empirical excess risks for models of reasonable dimension.
Concentration inequalities for the excess risks as in Theorem3.1 is a new and exciting direction of research related to the theory of statistical learning and to high-dimensional statistics. Boucheron and Massart  obtained a pioneering result describing the concentration of the empirical excess risk around its mean, a property that they call a high-dimensional Wilks phenomenon. Then a few authors obtained results describing the concentration of the true excess risk around its mean , ,  or around its median ,  for (penalized) least square regression and in an abstract M-estimation framework . In particular, recent results of  include the case of MLE on exponential models and as a matter of fact, on histograms. Nevertheless, Theorem 3.1 is a valuable addition to the literature on this line of research since we obtain here nonetheless the concentration around a fixed point, but an explicit value for this point. On the contrary, the concentration point is available in  only through an implicit formula involving local suprema of the underlying empirical process.
The principal assumption in Theorem 3.1 is inequality (14) of lower regularity of the partition with respect to . It is ensured as soon as the density is uniformly bounded from below and the partition is lower regular with respect to the reference measure (which will be the Lebesgue measure in our experiments). No restriction on the largest values of are needed. In particular, we do not restrict to the bounded density estimation setting.
Castellan  proved related, but weaker inequalities than in Theorem 3.1 above. She also asked for a lower regularity property of the partition, as in Proposition 2.5 , where she derived a sharp control of the KL divergence of the histogram estimator on a fixed model. More precisely, Castellan assumes that there exists a positive constant such that
This latter assumption is thus weaker than (14) for the considered model as its dimension is less than the order . We could assume (18) instead of (14) in order to derive Theorem 3.1. This would lead to less precise results for second order terms in the deviations of the excess risks but the first order bounds would be preserved. More precisely, if we replace assumption (14) in Theorem 3.1 by Castellan’s assumption (18), a careful look at the proofs show that the conclusions of Theorem 3.1 are still valid for , where is some positive constant. Thus assumption (14) is not a fundamental restriction in comparison to Castellan’s work , but it leads to more precise results in terms of deviations of the true and empirical excess risks of the histogram estimator.
3.2 An Oracle Inequality
First, let us state the set of five structural assumptions required to establish the nonasymptotic optimality of the over-penalization procedure. These assumptions will be discussed in more detail at the end of this section, following the statement of a sharp oracle inequality.
Set of assumptions (SA)
Polynomial complexity of :
Upper bound on dimensions of models in : there exists a positive constant such that for every
The unknown density satisfies some moment condition and is uniformly bounded from below: there exist some constants and such that,
Lower regularity of the partition with respect to : there exists a positive finite constant such that, for all ,
The bias decreases like a power of : there exist and such that
We are now ready to state our main theorem related to optimality of over-penalization.
Take and . For some , consider the following penalty,
Assume that the set of assumptions (SA) holds and that
Then there exists an event of probability at least and some positive constant depending only on the constants defined in (SA) such that, if