In this paper we study the classical problem of the estimation of a density function where . We observe
independent and identically distributed random variableswith density . In this context, an estimator is a measurable map where is a fixed parameter. The accuracy of is measured using the risk:
where is also a fixed parameter greater than or equal to and
denotes the expectation with respect to the probability measureof the observations. Moreover the -norm of a function is defined by
We are interested in finding data-driven procedures that achieve the minimax rate of convergence over Sobolev-type functional classes that map onto . The density estimation problem is widely studied and we refer the reader to Devroye and Györfi (1985) and Silverman (1986) for a broadly picture of this domain of statistics. One of the most popular ways to estimate a density function is to use kernel density estimates introduced by Rosenblatt (1956) and Parzen (1962). Given a kernel (that is a function such that
) and a bandwidth vector), such an estimator writes:
where and stands for the coordinate-wise division of the vectors and .
It is commonly admitted that bandwidth selection is the main point to estimate accurately the density function and a lot of popular selection procedures are proposed in the literature. Among others let us point out the cross validation (see Rudemo, 1982; Bowman, 1984; Chiu, 1991) as well as the procedure developed by Goldenshluger and Lepski in a series of papers in the last few years (see Goldenshluger and Lepski, 2008, 2011, 2014, for instance) and fruitfully applied in various contexts.
Dealing with bounded data, the so-called boundary bias problem has also to be taken into account. Indeed, classical kernels suffer from a severe bias term when the underlying density function does not vanish near the boundary of their support. To overcome this drawback, several procedures have been developed: Schuster (1985), Silverman (1986) and Cline and Hart (1991) studied the reflection of the data near the boundary as well as Marron and Ruppert (1994) who proposed a previous transformation of the data. Müller (1991), Lejeune and Sarda (1992), Jones (1993), Müller and Stadtmüller (1999) and Botev et al. (2010) proposed to construct kernels which take into account the shape of the support of the density. In the same spirit, Chen (1999)
studied a new class of kernels constructed using a reparametrization of the family of Beta distributions. For these methods, practical choices of bandwidth or cross-validation selection procedures have generally been proposed. Nevertheless few papers study the theoretical properties ofbandwidth selection procedures in this context. Among others, we point out Bouezmarni and Rombouts (2010)—who study the behavior of Beta kernels with a cross validation selection procedure in a multivariate setting in the specific case of a twice differentiable density. Bertin and Klutchnikoff (2014) study a selection rule based on the Lepski’s method (see Lepski, 1991) in conjunction with Beta kernels in a univariate setting and prove that the associated estimator is adaptive over Hölder classes of smoothness smaller than or equal to two. In this paper, we aim at constructing estimation procedures that address both problems (boundary bias and bandwidth selection) simultaneously and with optimal adaptive properties in norm () over a large scale of function classes. To tackle the boundary bias problem, we construct a family of kernel estimators based on new asymmetric kernels whose shape adapts to the position of the estimation point in . We propose two different data-driven procedures based on the Goldenshluger and Lepski approach that satisfy oracle-type inequalities (see Theorems 1 and 3). The first procedure, based on a fixed kernel, consists in selecting a bandwidth vector. Theorem 2 proves that the resulting estimator is adaptive over anisotropic Sobolev-Slobodetskii classes with smoothness parameters smaller than the order of the kernel and with the optimal rate with . The second procedure jointly selects a kernel (and its order) and a univariate bandwidth. Such selection procedures have been used only in the context of exact asymptotics in pointwise and sup-norm risks, and for very restrictive function classes. Theorem 4 states that this procedure is adaptive over isotropic Sobolev-Slobodetskii classes without any restriction on the smoothness parameter and achieves the optimal rate . These function classes are quite large and correspond to a special case of usual Besov classes (see Triebel, 1995). Note also that the same results can be obtained over anisotropic Hölder classes with the same rates of convergence. Such adaptive results without restrictions on the smoothness of the function to be estimated and with the optimal rates or have been established only for ellipsoid function classes as in Asin and Johannes (2016), among others. For bounded data, we also mention Rousseau (2010) or Autin et al. (2010) that construct adaptive estimators based on Bayesian mixtures of Beta and wavelets respectively but with an extra logarithmic term factor in the rate of convergence. Additionally note also that Beta kernel density estimators are minimax only for small smoothness (see Bertin and Klutchnikoff, 2011) and consequently neither allow us to obtain such adaptive results.
The rest of the paper is organized as follows. In Section 2, we detail the effect of the boundary bias and we propose a new family of estimators that do not suffer from this drawback. We construct in Section 3 our two main statistical procedures. The main results of the paper are stated in Section 4 whereas their proofs are postponed to Section 5.
2 On the boundary bias problem
2.1 Weakness of convolution kernel estimators
In this section we focus on the so-called boundary bias problem that arises when classical convolution kernels are used. To illustrate our point and for the sake of simplicity we assume that and . In what follows we consider the estimators defined in (3):
where is a bandwidth and the kernel is such that:
with . In this context, the following lemma—which is straightforward—proves that these estimators suffer from an asymptotic pointwise bias at the endpoint as soon as .
Assume that is continuous at . Then, we have .
However this problem is not specific to the endpoint and generalizes to a whole neighborhood of this point. The simplest situation that allows one to understand this phenomenon is to consider the estimation of the function where, here and after, stands for the indicator function of the interval . In this case, under a more restrictive assumption on the kernel, the integrated bias can be bounded from below by up to a multiplicative factor. More precisely we can state the following result:
Assume that is a kernel such that and assume that there exist and that satisfy
Then, for any we have:
As a consequence of this proposition we can state the following lower bound on the rate of convergence of the classical convolution kernel estimators over a very large family of functional classes.
Let . Let be a functional class such that . Assume that . Under the assumptions of Proposition 1, we have:
Now, let us comment these two results. First, remark that the assumptions made on the kernel are not very restrictive since any continuous symmetric kernel such that can be considered. Next, in view of Theorem 4 stated below, Proposition 2 proves that the convolution kernel estimators are not optimal. In particular, they do not achieve the minimax rate of convergence over usual Hölder classes with smoothness parameter (see Definition 2 as well as Remark 3 for more details). This result is mainly explained by Proposition 1 since, in this situation, the integrated bias term is greater than which is larger (in order) than the expected term (see Proposition 3 below).
2.2 Boundary kernel estimators
The main drawback of classical convolution kernels can be explained as follows: they look outside the support of the function to be estimated. As a consequence, is seen as a discontinuous function that maps to . This leads to a severe bias and explains why “boundary kernels” found in the literature have all their mass inside the support of the target function. Indeed, in this situation is seen as a function that maps to which is a very smooth function. This allows the bias term to be small (see Proposition 3 below)
In last decades, several papers proposed different constructions of kernels that can take into account the boundary problem. Let us point out that, among others, Müller and Stadtmüller (1999) and Chen (1999) constructed specific kernels whose shape adapts to the localization of the estimating point in a continuous way. Even if this continuously deforming seems to be an attractive property there are still some drawbacks to using such approaches. On the one hand, the beta kernels cannot be used to estimate smooth functions (see Bertin and Klutchnikoff, 2011). On the other hand, the kernels proposed by Müller and Stadtmüller (1999) are solutions of a continuous least square problem for each estimating point. In practice the kernels are computed using discretizations of the variational problems. This can be computationally intensive. Moreover, to our knowledge, there are no theoretical guarantees regarding bandwidth selection procedures in this context.
In this paper, we propose a simple and tractable way to construct boundary kernels that intends to solve the aforementioned problems. The main advantage of our construction lies in the fact that the resulting estimators are easy to compute and that the mathematical analysis of the adaptive procedure is made possible even in the anisotropic case.
To construct our kernels we first define the following set of univariate bounded kernels whose support is included into :
In the following, we will say that is a kernel of order if
Then, for any bandwidth and any sequence of kernels , we define the following density estimator:
where, for the “boundary” kernel is defined by:
for any . Here .
Note that, along each coordinate, the kernel is simply flipped according to the position of with respect to the closest boundary. Similar constructions can be found in the literature. For example Korostelev and Nussbaum (1999) and Bertin (2004) proposed to decompose into three different pieces — that depend on the bandwidth — as follows: . Specific kernels are used for the boundaries while classical kernels are used on . However, to our best knowledge, similar constructions in a multivariate framework do not allow to obtain adaptive results in the anisotropic case.
2.3 Bias over some functional classes
In this paper we focus on minimax rates of convergence over Sobolev-Slobodetskii classes. We recall their definitions in Definitions 1 and 2 (see also Simon, 1990; Opic and Rákosník, 1991; Triebel, 1995).
In the following, for and any and , we denote by the th-order partial derivative of f with respect to the variable . For any , we denote by the mixed partial derivatives
where . Finally, for any positive number , we denote by the largest integer strictly smaller than .
Set and . A function , belongs to the anisotropic Sobolev-Slobodetskii ball if:
belongs to .
For any , exists and belongs to .
The following property holds:
Set and . A function , belongs to the isotropic Sobolev-Slobodetskii ball if the following properties hold:
for any , such that , the mixed partial derivatives exist and belong to .
the Gagliardo semi-norm is bounded by where
where denotes the euclidean norm of .
These classes include several classical classes of functions. Indeed, in the isotropic case, when is not an integer, then corresponds to the usual Besov ball (see Triebel, 1995). Note that both definitions are the same when .
The following proposition illustrates the good properties in terms of bias of our boundary kernel estimators. It can be obtained following Propositions 4 and proof of Proposition 5 given in Section 5.
Let and . Let and such that for all is of order . Then we have:
where is a positive constant that depends only on , , and .
Let and . Let and such that for all is of order . Then we have:
where is a positive constant that depends only on , , and .
As we will see in Section 4, our boundary kernel estimators and Goldenshluger Lepski selection procedures based on them have also good properties in terms in minimax and adaptive rate of convergence over these classes.
3 Statistical procedures
We defined in Section 2.2 a large family of kernels estimators that are well-adapted to the estimation of bounded data. Two subfamilies of estimators designed for the estimation of isotropic or anisotropic functions are now considered in Sections 3.2 and 3.3 and a unique data-driven procedure is proposed in Section 3.4.
3.1 Family of bandwidth and kernels
We define the set of bandwidth vectors
with , and .
The family of bandwidth includes in particular for large enough all the bandwidths of the form with and . This family is then rich enough to attain all the optimal rates of convergence of the form for and .
It is possible to have a weaker condition on choosing with a positive constant that depends on . For the sake of simplicity, we choose to use to avoid multiple cases in terms of .
We consider the family of kernel defined by:
where and is the Legendre Polynomial of degree on (See Tsybakov (2009)). The kernels satisfy several properties given in the following lemma.
Set . The kernel is of order , satisfies and
where is the family of kernels of order . Moreover we have
where where and is the Hilbert matrix of order .
Figure 1 represents the kernels for different values of .
3.2 Isotropic family of estimators
For , we define:
where stands for the integer part of . We define For any , we consider where the univariate kernel is defined by (15).
We define the family of estimators where
The family contains kernel density estimators constructed with different kernels and bandwidths. Selecting , or the estimator in this family consists in fact in selecting jointly and automatically the order and the bandwidth of the estimator. The main idea that leads to this construction is the following: if we consider , then and . In other words, the estimator is constructed using a kernel of order greater than and the usual bandwidth (that is, of the classical order) used to estimate functions with smoothness parameter . The construction of such a class of estimators allows us to obtain adaptive estimators without any restriction on the smoothness parameter (see Theorem 4). However, arbitrary kernels of order cannot be used to prove Theorem 3 since a control of the -norm of the kernels is required. In particular in Lemma 2, we give bounds on the -norm of and we prove that is the kernel of order with the smallest norm within the kernels of of order .
Note that simultaneous choice of kernel and bandwidth has already been used in the framework of sharp adaptive estimation only for pointwise and sup-norm risks. On the one hand, in the Gaussian white noise model,Bertin (2005) and Lepski and Spokoiny (1997) assume that the smoothness parameter is less than or equal to . On the other hand, in the density model, Butucea (2001) consider the case of a finite grid of integer smoothness parameters and propose an adaptive procedure for the pointwise risk over a scale of classical Sobolev classes. Note that in this paper, the maximal smoothness parameter may tend to infinity as goes to infinity. To our understanding this possibility relies on the fact that the kernels are uniformly bounded by a constant that depends only on . Studying the risk over classical Sobolev classes on allows Butucea (2001)
to define the kernels on the Fourier domain and to replace the moment condition (5) by a weaker one (see Tsybakov, 2009, Section 1.3 for more details).
Our framework is very different. Indeed we consider the estimation in risks of densities with compact support that belongs to a scale of Sobolev-Slobodetskii classes indexed by a smoothness parameter . To do so we have to consider the classical moment condition (5) which implies, according to Lemma 2, that the sup-norm of any kernel of order tends to infinity with . This requires more technical control of the stochastic terms to obtain the minimax rates of convergence without additional logarithmic factor.
3.3 Anisotropic family of estimators
Let be such that, for any , is a bounded kernel and consider defined by:
We define the anisotropic family of estimators where
To make the notation similar to the isotropic case we define . Note that this family of estimators is more classical than the one constructed in the previous section. All the estimators are defined using the same kernel and depend only on a multivariate bandwidth. Nevertheless, in the following (see Theorem 2), we will choose a kernel such that for all is of order and a possible candidate is .
3.4 Selection rule
Although the two families differ, the selection procedure is the same in both cases. For the sake of generality, we introduce the following notation: is either or and then denotes or . For , and we define:
where is the best known constant in the Rosenthal inequality (see Johnson et al., 1985). For any we consider:
where is the vector with coordinates . Now, for any we define:
where denotes the positive part of .
We then select
which leads to the final plug-in estimator defined by . In what follows we denote by and the resulting estimators.
This procedure is inspired by the method developed by Goldenshluger and Lepski. Here is linked with the bias term of the estimator , see (144), and
is an empirical version of an upper bound on the standard deviation of this estimator. In fact, for, the standard deviation in -norm of on is bounded by . For , the bound depends on (see Lemma 3), that is the reason why we use an empirical version of this bound defined in (22). This implies that realizes a trade-off between and . This can be interpreted as an empirical counterpart of the classical trade-off between the bias and the standard deviation. Note that as discussed in Lacour and Massart (2016) it is also possible to consider in (24) a different constant satisfying .
In this section we present our results. Theorem 1 consists in an oracle-type inequality which guarantees that the anisotropic estimation procedure defined above performs almost as well as the best estimator from the collection . Moreover, Theorem 2 states that this procedure also achieves the minimax rate of convergence simultaneously over each anisotropic Sobolev-Slobodetskii class in a given scale.
Assume that is a density function such that . Then there exists a positive constant that depends only on , , , and , such that, for any :
Set , and . Assume that is such that is of order greater than or equal to . Then, our estimation procedure is such that:
Moreover, if is such that any is not an integer, the following property is satisfied:
where the infimum is taken over all possible estimators.
Theorems 3 and 4 are the analogues of Theorems 1 and 2 respectively, transposed to the isotropic estimation procedure. Note however that the scale of functional classes considered in Theorem 4 is huge since there is no restriction on the smoothness parameter , contrary to classical results (including Theorem 2).
Assume that is a density function such that . Then there exists a positive constant that depends only on , , and , such that, for any :
Set and . We have:
and, if is not an integer,
where the infimum is taken again over all possible estimators.
Theorems 2 and 4 are established for scales of Sobolev-Slobodetskii classes. However similar results are still true if one replaces these classes with classical (an)isotropic Hölder classes. Remark also that the lower bounds are proved for non-integer smoothness parameters. As mentioned above, in this situation, the Sobolev-Slobodetskii classes correspond to usual Besov spaces.
In Theorems 1 and 3, the right hand sides of the equations can be easily interpreted. In both situations, the term is of the order of the standard deviation of . Moreover the terms and are linked with the bias of this estimator. More precisely, Proposition 4 and Proposition 5 ensure that these terms have the same behaviour as the bias term as soon as belongs to Sobolev-Slobodetskii classes.
Finally, to our best knowledge, even in the case of density with support in , adaptive results in without restriction on the smoothness parameter as in Theorem 4 are not known for either the Sobolev-Slobodetskii classes or the Hölder classes. This is not the case in Theorem 2 where the adaptive result is obtained only for where the are the orders of the kernel . The main difference between the isotropic case and the anisotropic case lies in the control of the quantity which is linked with the terms for . In the isotropic case, if , these terms vanish and it remains to control
The study of (31) involves Taylor expansion of and each estimator can be based on a different kernel. In the anisotropic case, (31) is never more valid and can be expressed in terms of the difference of in two different values (in order to use a Taylor expansion) only when and are based on the same kernel.
The proofs of Theorems 1–4 are based on propositions and lemmas which are given below. Before stating these results, we introduce some notation that are used throughout the rest of the paper. For , and , we define the quantity:
For and we denote
The process is defined by
Finally, for we define (using the generic notation for the isotropic and the anisotropic cases):
Proposition 4 (Anisotropic case).
Set . Assume that is such that is of order greater than or equal to . Set and . Then, for any :
Proposition 5 (Isotropic case).
Set and . Then for any we have:
where the positive constant depends only on , , and .
Set . Assume that is such that .
Let . There exists a positive constant that depends only on , , and such that
Let . There exists a positive constant that depends only on , , , and such that