# Adaptive optimal kernel density estimation for directional data

We focus on the nonparametric density estimation problem with directional data. We propose a new rule for bandwidth selection for kernel density estimation. Our procedure is automatic, fully data-driven and adaptive to the smoothness degree of the density. We obtain an oracle inequality and optimal rates of convergence for the L2 error. Our theoretical results are illustrated with simulations.

07/30/2020

### Adaptive nonparametric estimation of a component density in a two-class mixture model

A two-class mixture model, where the density of one of the components is...
10/25/2018

### Adaptive Density Estimation on Bounded Domains

We study the estimation, in Lp-norm, of density functions defined on [0,...
06/28/2021

### Adaptive greedy algorithm for moderately large dimensions in kernel conditional density estimation

This paper studies the estimation of the conditional density f (x, ×) of...
10/17/2019

### Obfuscation via Information Density Estimation

Identifying features that leak information about sensitive attributes is...
05/11/2018

### Robust Comparison of Kernel Densities on Spherical Domains

While spherical data arises in many contexts, including in directional s...
03/25/2019

### β-Divergence loss for the kernel density estimation with bias reduced

Allthough nonparametric kernel density estimation with bias reduce is no...
06/20/2012

### Fast Nonparametric Conditional Density Estimation

Conditional density estimation generalizes regression by modeling a full...

## 1 Introduction

Directional data arise in many fields such as wind direction for the circular case, astrophysics, paleomagnetism, geology for the spherical case. Many efforts have been put to devise statistical methods to tackle the density estimation problem. We refer to [Mardia and Jupp, 2000] and more recently to [Ley and Verdebout, 2017] for a comprehensive view. Nonparametric procedures have been well developed. In this article we focus on kernel estimation but we may cite a series of works using projection methods on localized bases adapted to the sphere ([Baldi et al., 2009], [Kerkyacharian et al., 2011]). Classical references for kernel estimation with directional data include the seminal papers of [Hall et al., 1987] and [Bai et al., 1988]. It is well-known that the choice of the bandwidth is a key and intricate issue when using kernel methods. Various techniques for selecting the bandwidth have been suggested since the popular cross-validation rule by [Hall et al., 1987]. We shall cite plug-in and refined cross-validatory methods in [Taylor, 2008] and [Oliveira et al., 2012] for the circular case, [Di Marzio et al., 2011] on the torus. More recently, [García-Portugués, 2013] devised an equivalent of the rule-of-thumb of [Silverman, 1986] for directional data, whereas [Amiri et al., 2017] explored computational problems with recursive kernel estimators based on the cross-validation procedure of [Hall et al., 1987]. But to the best of our knowledge, all the rules proposed so far for selecting the bandwidth are empirical. Although they prove efficient in practice, only little attention has been put on their theoretical properties. [Klemelä, 2000] studied convergence rates for error over some regularity classes for the kernel estimator for the estimation of the density and its derivatives but the asymptotically optimal bandwidth depends on the density and its smoothness degree which is unfeasible for applications. In the present paper, we try to fill the gap between theory and practice. Our goal is two-fold. We aim at devising an automatic and fully data driven choice of the kernel bandwidth so that the resulted estimator achieves optimal rates in the minimax sense for the risk over some regularity classes. We emphasize that the estimator is adaptive to the smoothness degree of the underlying density. It means that the method does not require the specification of the regularity of the density. Our work is inspired by very recent techniques developped for the multivariate case. In the problem of multivariate kernel density estimation, adaptive minimax approaches have been tackled in a remarkable series of papers by [Lepskiĭ, 1990], [Lepskiĭ, 1991], [Goldenshluger and Lepski, 2008] and very recently by [Lacour et al., 2017]. Our methodology is inspired from the PCO (Penalized Comparison of Overfiting) procedure of [Lacour et al., 2017]. It is based on concentration inequalities for statistics. Last but not least, our procedure is simple to be implemented and in examples based on simulations, it shows quite good performances in a reasonable computation time.

This paper is organized as follows. In section 2, we present our estimation procedure. In section 3 we provide an oracle inequality and rates of convergences of our estimator for the MISE. Section 4 gives some numerical illustrations. Section 5 gives the proofs of theorems. Finally the Appendix gathers some technical lemmas.

Notations For two integers , , we denote and . And denotes the largest integer smaller than such that .

Depending on the context, denotes the classical -norm on or ,

the associated scalar product. For a vector

, stands for the Euclidian norm on . And is the usual -norm on .

The scalar product of two vectors and , is denoted by , where is the transpose operator.

## 2 Estimation procedure

We observe i.i.d observations on the unit sphere of , . The ’s are absolutely continuous with respect to the Lebesgue measure on with common density . Therefore, a directional density satisfies

 ∫Sd−1f(x)ωd(dx)=1.

We aim at constructing an adaptive kernel estimator of the density with a fully data-driven choice of the bandwidth.

### 2.1 Directional approximation kernel

We present some technical conditions that are required for the kernel.

Assumption 1 The kernel is a bounded and Riemann integrable function such that

 0<∫∞0x(d−3)/2K(x)dx<∞.

Assumption 1 is classical in kernel density estimation with directional data, see for instance Assumptions D1-D3 in [García-Portugués et al., 2013] and Assumption A1 in [Amiri et al., 2017]. An example of kernel which satisfies Assumption 1 is the popular von-Mises kernel .

Now following [Klemelä, 2000] we shall define what is called a kernel of class . Let

 αi(K)=∫∞0x(i+d−3)/2K(x)dx,i∈N.

Assumption 2 Let be even. The kernel is of class i.e it is a measurable function which satisfies :

• for

• ,

• for , when tends to .

### 2.2 Family of directional kernel estimators

We consider the following classical directional kernel density estimator

 ^fh(x)=c0(h)nn∑i=1K(1−xTXih2)=c0(h)nn∑i=1Kh2(x,Xi),x∈Sd−1,

where is a kernel satisfying Assumption 1 and a normalizing constant such that integrates to unity:

 c−10(h):=∫Sd−1Kh2(x,y)ωd(dy). (2.1)

It remains to select a convenient value for .

### 2.3 Bandwidth selection

In kernel density estimation, a delicate step consists in selecting the proper bandwith for . We suggest the following data-driven choice of bandwith inspired from [Lacour et al., 2017]. We name our procedure SPCO (Spherical Penalized Comparison to Overfitting). We denote Our selection rule is the following:

 ^h:=argminh∈H{∥^fh−^fhmin∥2+penλ(h)}λ∈R, (2.2)

where

 penλ(h):=λc20(h)c2(h)n−1n∫Sd−1(c0(hmin)Kh2min(x,y)−c0(h)Kh2(x,y))2ωd(dy), (2.3)
 c2(h):=∫Sd−1K2h2(x,y)ωd(dy),

and a set of bandwiths defined by

 H=⎧⎪⎨⎪⎩h:(∥K∥∞n1R0(K))1d−1≤h≤1, and h−1is an integer⎫⎪⎬⎪⎭, (2.4)

with and denotes the area of . We recall that with the Gamma function.

The estimator of is .

The procedure SPCO involves a real parameter . In section 3 we study how to choose the optimal value of leading to a fully data driven procedure.

###### Remark 1.

Note that , and do not depend on . Indeed its is known (see [Hall et al., 1987]) that if is a vector and a fixed element of , then denoting their scalar product, we may always write

 y=tx+(1−t2)1/2ξ,

where is a unit vector orthogonal to . Further, the area element on can be written as

 ωd(dx)=(1−t2)d−32dtωd−1(dξ).

Thus, using these conventions, one obtains

 c−10(h) = ∫Sd−1K(1−xTyh2)ωd(dy) = ∫Sd−2∫1−1K(1−xT(tx+(1−t2)1/2ξ)h2)(1−t2)d−32dtωd−1(dξ) = ∫Sd−2ωd−1(dξ)∫1−1K(1−txTxh2)(1−t2)d−32dt = σd−2∫1−1K(1−th2)(1−t2)d−32dt.

Using similar computations, one gets that

 c2(h)=σd−2∫1−1K2(1−th2)(1−t2)(d−3)/2dt,

and

 penλ(h)=λc20(h)c2(h)n−σd−2n∫1−1(c0(hmin)K(1−th2min)−c0(h)K(1−th2))2(1−t2)(d−3)/2dt.

## 3 Rates of convergence

### 3.1 Oracle inequality

First, we state an oracle type inequality which highlights the bias-variance decomposition of the

risk. denotes the cardinality of . We recall that we denote .

###### Theorem 1.

Assume that kernel satisfies Assumption 1 and . Let and . Then there exists independent of , such that for

with probability larger than

,

 ∥^f^h−f∥2≤C0(ε,λ)minh∈H∥^fh−f∥2+C2(ε,λ)∥fhmin−f∥2+C3(ε,K,λ)(∥f∥∞x2n+c0(hmin)x3n2), (3.1)

where is an absolute constant and if , if . The constant only depends on and and only depends on , and .

This oracle inequality bounds the quadratic risk of SPCO estimator by the infimum over of the tradeoff between the approximation term and the variance term . The terms are remaining terms. Hence, this oracle inequality justifies our selection rule. For further details about oracle inequalities and model selection see [Massart, 2007].

The next theorem shows that we cannot choose too small () at the risk of selecting a bandwidth close to with high probability. This would lead to an overfitting estimator. To this purpose, we suppose

 ∥f−fhmin∥2nc20(hmin)c2(hmin)=o(1). (3.2)

This assumption is quite mild. Indeed the variance of is of order , thus this assumption means that the smallest bias is negligible with respect to the corresponding integrated variance. Last but not least, because is of order (see Lemme 3), this assumption amounts to

 ∥f−fhmin∥2=o(h1−dminn).
###### Theorem 2.

Assume that kernel satisfies Assumption 1 and . Assume also (3.2) and

 ∥K∥∞n1R0(K)≤hd−1min≤(logn)βn,β>0.

Then if we consider defined in (2.3) with , then we have for large enough, with probability larger than :

 ^h≤C(λ)hmin≤C(λ)((logn)βn)1d−1,

where is an absolute constant and .

###### Remark 2.

Theorem 2 invites us to discard . Now considering oracle inequality (3.6), yields the minimal value of the leading constant . Thus, the theory urges us to take the optimal value in the SPCO procedure. Actually, we will see in the numerical section that the choice is quite efficient.

### 3.2 Tuning-free estimator and rates of convergence

Results of Section 3.1 about the optimality of enable us to devise our tuning-free estimator with bandwidth defined as follows

 ˇh:=argminh∈H{∥^fh−^fhmin∥2+pen(h)} (3.3)

and

 pen(h):=c20(h)c2(h)n−1n∫Sd−1(c0(hmin)Kh2min(x,y)−c0(h)Kh2(x,y))2ωd(dy), (3.4)

and where is a set of bandwiths defined by

 H=⎧⎪⎨⎪⎩h:(∥K∥∞n1R0(K))1d−1≤h≤1, and h−1is an integer⎫⎪⎬⎪⎭. (3.5)

The following corollary of Theorem 1 states an oracle inequality satisfied by . It will be central to compute rates of convergence.

###### Corollary 1.

Assume that kernel satisfies Assumption 1 and . Let and . Then there exists independent of , such that for with probability larger than ,

 ∥^fˇh−f∥2≤C0(ε)minh∈H∥^fh−f∥2+C2(ε)∥fhmin−f∥2+C3(ε,K)(∥f∥∞x2n+c0(hmin)x3n2), (3.6)

where is an absolute constant and . The constant only depends on and only depends on and .

We now compute rates of convergence for the MISE (Mean Integrated Square Error) of our estimator over some smoothness classes. [Klemelä, 2000] defined suitable smoothness classes for the study of the MISE. In particular, theses regularity classes involve a concept of an ”average” of directional derivatives which was first defined in [Hall et al., 1987]. Let us recall the definition of these smoothness classes.

Let , and . Let be a parameterization of defined by

 ϕ−1η(ξ,θ)=ηcos(θ)+ξsin(θ).

When and , define the derivative of at in the direction of to be and for an integer.

We now shall define the derivative of order .

###### Definition 1.

Let define by . The derivative of order is defined by

 Dsf(x)=1σd−1∫TxDsξ¯f(x)ωd(dξ),

where , .

###### Definition 2.

When define by

 ~Dsf(x,θ)=1σd−1∫TxDsϕ−1x(ξ,θ+π2)¯f(ϕ−1x(ξ,θ))ωd(dξ).

We are now able to define the smoothness class (see [Klemelä, 2000]).

###### Definition 3.

Let be even and . Let be the set of such functions that for for all and for all , is continuous as a function of , is bounded for and .

An application of the oracle inequality in Corollary 1 allows us to derive rates of convergence for the MISE of .

###### Theorem 3.

Consider a kernel satisfying Assumption 1 and Assumption 2. For , let us denote the set of densities bounded by and belonging to . Then we have

 supf∈~F(s,B)E[∥^fˇh−f∥2]≤C(s,K,d,B)n−2s2s+(d−1),

with a constant depending on , , and .

Theorem 3 shows that the estimator achieves the optimal rate of convergence in dimension (minimax rates for multivariate density estimation are studied in [Ibragimov and Khasminski, 1980], [Ibragimov and Khasminski, 1981], [Hasminskii and Ibragimov, 1990]). Furthermore, our statistical procedure is adaptive to the smoothness . It means that it does not require the specification of .

## 4 Numerical results

We investigate the numerical performances of our fully data-driven estimator defined in section 3.2 with simulations. We compare to the widely used cross-validation estimator and to the ”oracle” (that will be defined later on). We focus on the unit sphere (i.e the case ).

We aim at estimating the von-Mises Fisher density (see Figure 1):

 f1,vM=κ2π(eκ−e−κ)eκxTμ,

with and . We recall that is the concentration parameter and the directional mean. We also estimate the mixture of two von-Mises Fisher densities, :

 f2,vM=45×κ2π(eκ−e−κ)eκxTμ+15×κ′2π(eκ′−e−κ′)eκ′xTμ′,

with and . Note that the smaller the concentration parameter is, the closer to the uniform density the von-Mises Fisher density is.

Now let us define what the ”oracle” is. The bandwidth is defined as

 horacle=argminh∈H∥^fh−f∥2.

can be viewed as the ”ideal” bandwidth since it uses the specification of the density of interest which is here either or . Hence, the performances of are used as a benchmark.

In the sequel we present detailed results for , namely risk curves and graphic reconstructions and we compute MISE both for and . We use the von-Mises kernel .

Before presenting the performances of the various procedures, we shall remind that theoretical results of section 3.1 have shown that setting in the SPCO algorithm was optimal. We would like to show how simulations actually support this conclusion. Indeed, Figure 2 displays the -risk of in function of parameter . Figure 2 a/ shows a ”dimension jump” and that the minimal risk is reached in a stable zone around : negative values of lead to an overfitting estimator ( is chosen) with an explosion of the risk, whereas large values of make the risk increase again (see a zoom on Figure 2 b/). Next, we will realize that yields quite good results.

In Lemma 4 of the Appendix, we clarify the expression (3.3) to be minimized to implement our estimator . We now recall the cross-validation criterion of [Hall et al., 1987]. Let

 ^fh,i(x)=c0(h)n−1n∑j≠ie−(1−txXj)/h2

then

 CV2(h)=∥^fh∥2−2nn∑i=1^fh,i(x).

is an unbiased estimate of the MISE of

. The cross-validation procedure to select the bandwidth consists in minimizing with respect to . We call this selected value .

In the rest of this section, will denote the estimation procedure related to . In Figure 3, for we plot as a function of : for the oracle, for SPCO and for cross-validation. We point out on each graphic the value of that minimizes each quantity. In Figure 4, we plot in spherical coordinates, for , the density and density reconstructions for the oracle, SPCO and cross-validation. Eventually, in Tables 1 and 2, we compute MISE to estimate and for the oracle, SPCO and cross-validation for and , over Monte-Carlo runs.

When analyzing the results, SPCO shows quite satisfying performances. Indeed, SPCO is close to the oracle and is slightly better than cross-validation when looking at the MISE computations for both densities.

## 5 Proofs

Before proving Theorem 1, we need several intermediate results. The first one is the following proposition which is the counterpart of Proposition 4.1 of [Lerasle et al., 2016] for .

Let

 R1(K):=2(d−3)/2σd−2∫∞0x(d−3)/2K2(x)dx. (5.1)
###### Proposition 4.

Assume that kernel satisfies Assumption 1. Let . There exists , such that for ( not depending on ), all and for all with probability larger than , for all each of the following inequalities holds

 ∥f−^fh∥ ≤ (1+η)(∥f−fh∥2+c20(h)c2(h)n)+□Υx2η3n (5.2) ∥f−fh∥2+c20(h)c2(h)n ≤ (1+η)∥f−^fh∥2+□Υx2η3n. (5.3)

Proof of Proposition 4.

To prove Proposition 4, we need to verify Assumptions (11) - (16) of [Lerasle et al., 2016]. We remind that

 fh=E(^fh)=c0(h)∫y∈Sd−1f(y)Kh2(x,y)ωd(dy).

Let us check Assumption (11) of [Lerasle et al., 2016]. This one amounts to prove that for some and

 Γ(1+∥f∥∞)∨suph∈H∥fh∥2≤Υ.

We have

 ∥fh∥2 ≤ ∥f∥∞∫x(∫yc0(h)Kh2(x,y)ωd(dy))=1(∫yf(y)c0(h)Kh2(x,y)ωd(dy))ωd(dx) ≤ ∥f∥∞∫yf(y)∫xc0(h)Kh2(x,y)ωd(dx)ωd(dy) ≤ ∥f∥∞,

hence Assumption (11) in [Lerasle et al., 2016] holds with and .

Let us check Assumption (12) of [Lerasle et al., 2016]. We have to prove that

 ∫c20(h)K2h2(x,x)ωd(dx)≤Υn∫∫c20(h)K2h2(x,y)ωd(dx)f(y)ωd(dy).

But since

 ∫∫c20(h)K2h2(x,y)ωd(dx)f(y)ωd(dy)=c20(h)c2(h)∫f(y)ωd(dy)=c20(h)c2(h),

and

 ∫c20(h)K2h2(x,x)ωd(dx)=4πc20(h)K2(0),

Assumption (12) amounts to check that

 4πc20(h)K2(0)≤Υnc20(h)c2(h)⟺Υ≥4πK2(0)nc2(h).

But using (6.2), we have

 c2(h)=R1(K)hd−1+o(1),

when tends to uniformly in . Thus there exists , independent of , such that for , . Now using that and , it is sufficient to have to ensure Assumption (12) in [Lerasle et al., 2016].

Assumption (13) in [Lerasle et al., 2016] consists to prove that

 ∥fh−fh′∥∞≤Υ∨√Υn∥fh−fh′∥2.

For any and any , we have

 ∥fh∥∞≤∥f∥∞,

therefore Assumption (13) in [Lerasle et al., 2016] holds for .

Assumptions (14) and (15) of [Lerasle et al., 2016] consist in proving respectively that

 E[c20(h)∫Kh2(X,z)Kh2(z,Y)ωd(dz)]2≤Υc20(h)c2(h)

and

 supx∈Sd−1E[c20(h)∫Kh2(X,z)Kh2(z,x)ωp(dz)]2≤Υn.

We have

 c20(h)∫Kh2(x,z)Kh2(z,y)ωd(dz)≤c20(h)c2(h)∧c0(h)∥K∥∞.

Indeed if , then , otherwise

 c20(h)∫Kh2(x,z)Kh2(z,y)ωd(dz)≤c0(h)∥K∥∞c0(h)∫Kh2(x,z)ωd(dz)=1=c0(h)∥K∥∞.

Furthermore (6.1) entails that there exists independent of , such that for , and consequently , using (3.5). Thus for

 c20(h)∫Kh2(x,z)Kh2(z,y)ωd(dz)≤c20(h)c2(h)∧2n. (5.4)

We have

 E[c20(h)∫zKh2(X,z)Kh2(z,x)ωd(dz)] = c20(h)∫z(∫yKh2(y,z)f(y)ωd(dy))Kh2(z,x)ωd(dz) (5.5) ≤ ∥f∥∞c0(h)∫zc0(h)∫yKh2(y,z)ωd(dy)=1Kh2(z,x)ωd(dz) ≤ ∥f∥∞.

Therefore for

 supx∈Sd−1E[c20(h)