Directional data arise in many fields such as wind direction for the circular case, astrophysics, paleomagnetism, geology for the spherical case. Many efforts have been put to devise statistical methods to tackle the density estimation problem. We refer to [Mardia and Jupp, 2000] and more recently to [Ley and Verdebout, 2017] for a comprehensive view. Nonparametric procedures have been well developed. In this article we focus on kernel estimation but we may cite a series of works using projection methods on localized bases adapted to the sphere ([Baldi et al., 2009], [Kerkyacharian et al., 2011]). Classical references for kernel estimation with directional data include the seminal papers of [Hall et al., 1987] and [Bai et al., 1988]. It is well-known that the choice of the bandwidth is a key and intricate issue when using kernel methods. Various techniques for selecting the bandwidth have been suggested since the popular cross-validation rule by [Hall et al., 1987]. We shall cite plug-in and refined cross-validatory methods in [Taylor, 2008] and [Oliveira et al., 2012] for the circular case, [Di Marzio et al., 2011] on the torus. More recently, [García-Portugués, 2013] devised an equivalent of the rule-of-thumb of [Silverman, 1986] for directional data, whereas [Amiri et al., 2017] explored computational problems with recursive kernel estimators based on the cross-validation procedure of [Hall et al., 1987]. But to the best of our knowledge, all the rules proposed so far for selecting the bandwidth are empirical. Although they prove efficient in practice, only little attention has been put on their theoretical properties. [Klemelä, 2000] studied convergence rates for error over some regularity classes for the kernel estimator for the estimation of the density and its derivatives but the asymptotically optimal bandwidth depends on the density and its smoothness degree which is unfeasible for applications. In the present paper, we try to fill the gap between theory and practice. Our goal is two-fold. We aim at devising an automatic and fully data driven choice of the kernel bandwidth so that the resulted estimator achieves optimal rates in the minimax sense for the risk over some regularity classes. We emphasize that the estimator is adaptive to the smoothness degree of the underlying density. It means that the method does not require the specification of the regularity of the density. Our work is inspired by very recent techniques developped for the multivariate case. In the problem of multivariate kernel density estimation, adaptive minimax approaches have been tackled in a remarkable series of papers by [Lepskiĭ, 1990], [Lepskiĭ, 1991], [Goldenshluger and Lepski, 2008] and very recently by [Lacour et al., 2017]. Our methodology is inspired from the PCO (Penalized Comparison of Overfiting) procedure of [Lacour et al., 2017]. It is based on concentration inequalities for statistics. Last but not least, our procedure is simple to be implemented and in examples based on simulations, it shows quite good performances in a reasonable computation time.
This paper is organized as follows. In section 2, we present our estimation procedure. In section 3 we provide an oracle inequality and rates of convergences of our estimator for the MISE. Section 4 gives some numerical illustrations. Section 5 gives the proofs of theorems. Finally the Appendix gathers some technical lemmas.
Notations For two integers , , we denote and . And denotes the largest integer smaller than such that .
Depending on the context, denotes the classical -norm on or ,
the associated scalar product. For a vector, stands for the Euclidian norm on . And is the usual -norm on .
The scalar product of two vectors and , is denoted by , where is the transpose operator.
2 Estimation procedure
We observe i.i.d observations on the unit sphere of , . The ’s are absolutely continuous with respect to the Lebesgue measure on with common density . Therefore, a directional density satisfies
We aim at constructing an adaptive kernel estimator of the density with a fully data-driven choice of the bandwidth.
2.1 Directional approximation kernel
We present some technical conditions that are required for the kernel.
Assumption 1 The kernel is a bounded and Riemann integrable function such that
Assumption 1 is classical in kernel density estimation with directional data, see for instance Assumptions D1-D3 in [García-Portugués et al., 2013] and Assumption A1 in [Amiri et al., 2017]. An example of kernel which satisfies Assumption 1 is the popular von-Mises kernel .
Now following [Klemelä, 2000] we shall define what is called a kernel of class . Let
Assumption 2 Let be even. The kernel is of class i.e it is a measurable function which satisfies :
for , when tends to .
2.2 Family of directional kernel estimators
We consider the following classical directional kernel density estimator
where is a kernel satisfying Assumption 1 and a normalizing constant such that integrates to unity:
It remains to select a convenient value for .
2.3 Bandwidth selection
In kernel density estimation, a delicate step consists in selecting the proper bandwith for . We suggest the following data-driven choice of bandwith inspired from [Lacour et al., 2017]. We name our procedure SPCO (Spherical Penalized Comparison to Overfitting). We denote Our selection rule is the following:
and a set of bandwiths defined by
with and denotes the area of . We recall that with the Gamma function.
The estimator of is .
The procedure SPCO involves a real parameter . In section 3 we study how to choose the optimal value of leading to a fully data driven procedure.
Note that , and do not depend on . Indeed its is known (see [Hall et al., 1987]) that if is a vector and a fixed element of , then denoting their scalar product, we may always write
where is a unit vector orthogonal to . Further, the area element on can be written as
Thus, using these conventions, one obtains
Using similar computations, one gets that
3 Rates of convergence
3.1 Oracle inequality
First, we state an oracle type inequality which highlights the bias-variance decomposition of therisk. denotes the cardinality of . We recall that we denote .
Assume that kernel satisfies Assumption 1 and . Let and .
Then there exists independent of , such that for with probability larger than
with probability larger than,
where is an absolute constant and if , if . The constant only depends on and and only depends on , and .
This oracle inequality bounds the quadratic risk of SPCO estimator by the infimum over of the tradeoff between the approximation term and the variance term . The terms are remaining terms. Hence, this oracle inequality justifies our selection rule. For further details about oracle inequalities and model selection see [Massart, 2007].
The next theorem shows that we cannot choose too small () at the risk of selecting a bandwidth close to with high probability. This would lead to an overfitting estimator. To this purpose, we suppose
This assumption is quite mild. Indeed the variance of is of order , thus this assumption means that the smallest bias is negligible with respect to the corresponding integrated variance. Last but not least, because is of order (see Lemme 3), this assumption amounts to
3.2 Tuning-free estimator and rates of convergence
Results of Section 3.1 about the optimality of enable us to devise our tuning-free estimator with bandwidth defined as follows
and where is a set of bandwiths defined by
The following corollary of Theorem 1 states an oracle inequality satisfied by . It will be central to compute rates of convergence.
Assume that kernel satisfies Assumption 1 and . Let and . Then there exists independent of , such that for with probability larger than ,
where is an absolute constant and . The constant only depends on and only depends on and .
We now compute rates of convergence for the MISE (Mean Integrated Square Error) of our estimator over some smoothness classes. [Klemelä, 2000] defined suitable smoothness classes for the study of the MISE. In particular, theses regularity classes involve a concept of an ”average” of directional derivatives which was first defined in [Hall et al., 1987]. Let us recall the definition of these smoothness classes.
Let , and . Let be a parameterization of defined by
When and , define the derivative of at in the direction of to be and for an integer.
We now shall define the derivative of order .
Let define by . The derivative of order is defined by
where , .
When define by
We are now able to define the smoothness class (see [Klemelä, 2000]).
Let be even and . Let be the set of such functions that for for all and for all , is continuous as a function of , is bounded for and .
An application of the oracle inequality in Corollary 1 allows us to derive rates of convergence for the MISE of .
Consider a kernel satisfying Assumption 1 and Assumption 2. For , let us denote the set of densities bounded by and belonging to . Then we have
with a constant depending on , , and .
Theorem 3 shows that the estimator achieves the optimal rate of convergence in dimension (minimax rates for multivariate density estimation are studied in [Ibragimov and Khasminski, 1980], [Ibragimov and Khasminski, 1981], [Hasminskii and Ibragimov, 1990]). Furthermore, our statistical procedure is adaptive to the smoothness . It means that it does not require the specification of .
4 Numerical results
We investigate the numerical performances of our fully data-driven estimator defined in section 3.2 with simulations. We compare to the widely used cross-validation estimator and to the ”oracle” (that will be defined later on). We focus on the unit sphere (i.e the case ).
We aim at estimating the von-Mises Fisher density (see Figure 1):
with and . We recall that is the concentration parameter and the directional mean. We also estimate the mixture of two von-Mises Fisher densities, :
with and . Note that the smaller the concentration parameter is, the closer to the uniform density the von-Mises Fisher density is.
Now let us define what the ”oracle” is. The bandwidth is defined as
can be viewed as the ”ideal” bandwidth since it uses the specification of the density of interest which is here either or . Hence, the performances of are used as a benchmark.
In the sequel we present detailed results for , namely risk curves and graphic reconstructions and we compute MISE both for and . We use the von-Mises kernel .
Before presenting the performances of the various procedures, we shall remind that theoretical results of section 3.1 have shown that setting in the SPCO algorithm was optimal. We would like to show how simulations actually support this conclusion. Indeed, Figure 2 displays the -risk of in function of parameter . Figure 2 a/ shows a ”dimension jump” and that the minimal risk is reached in a stable zone around : negative values of lead to an overfitting estimator ( is chosen) with an explosion of the risk, whereas large values of make the risk increase again (see a zoom on Figure 2 b/). Next, we will realize that yields quite good results.
is an unbiased estimate of the MISE of. The cross-validation procedure to select the bandwidth consists in minimizing with respect to . We call this selected value .
Vertical red lines represent the bandwidth value minimizing each curve.
c/ cross-validation ,
In the rest of this section, will denote the estimation procedure related to . In Figure 3, for we plot as a function of : for the oracle, for SPCO and for cross-validation. We point out on each graphic the value of that minimizes each quantity. In Figure 4, we plot in spherical coordinates, for , the density and density reconstructions for the oracle, SPCO and cross-validation. Eventually, in Tables 1 and 2, we compute MISE to estimate and for the oracle, SPCO and cross-validation for and , over Monte-Carlo runs.
When analyzing the results, SPCO shows quite satisfying performances. Indeed, SPCO is close to the oracle and is slightly better than cross-validation when looking at the MISE computations for both densities.
Assume that kernel satisfies Assumption 1. Let . There exists , such that for ( not depending on ), all and for all with probability larger than , for all each of the following inequalities holds
Proof of Proposition 4.
Let us check Assumption (11) of [Lerasle et al., 2016]. This one amounts to prove that for some and
hence Assumption (11) in [Lerasle et al., 2016] holds with and .
Let us check Assumption (12) of [Lerasle et al., 2016]. We have to prove that
Assumption (12) amounts to check that
But using (6.2), we have
when tends to uniformly in . Thus there exists , independent of , such that for , . Now using that and , it is sufficient to have to ensure Assumption (12) in [Lerasle et al., 2016].
Assumption (13) in [Lerasle et al., 2016] consists to prove that
For any and any , we have
therefore Assumption (13) in [Lerasle et al., 2016] holds for .
Assumptions (14) and (15) of [Lerasle et al., 2016] consist in proving respectively that
Indeed if , then , otherwise