# Bandwidth Selection for the Wolverton-Wagner Estimator

For n independent random variables having the same Hölder continuous density, this paper deals with controls of the Wolverton-Wagner's estimator MSE and MISE. Then, for a bandwidth h_n(β), estimators of β are obtained by a Goldenshluger-Lepski type method and a Lacour-Massart-Rivoirard type method. Some numerical experiments are provided for this last method.

## Authors

• 6 publications
• 11 publications
05/14/2021

### Nadaraya-Watson Estimator for I.I.D. Paths of Diffusion Processes

This paper deals with a nonparametric Nadaraya-Watson estimator b of the...
06/13/2020

### Kernel Selection in Nonparametric Regression

In the regression model Y = b(X) +ε, where X has a density f, this paper...
01/26/2020

### On a Nadaraya-Watson Estimator with Two Bandwidths

In a regression model, we write the Nadaraya-Watson estimator of the reg...
12/02/2017

### Central limit theorem for the variable bandwidth kernel density estimators

In this paper we study the ideal variable bandwidth kernel density estim...
01/12/2021

### Variable bandwidth kernel regression estimation

In this paper we propose a variable bandwidth kernel regression estimato...
01/01/2018

### Scalable Hash-Based Estimation of Divergence Measures

We propose a scalable divergence estimation method based on hashing. Con...
12/20/2012

### SMML estimators for 1-dimensional continuous data

A method is given for calculating the strict minimum message length (SMM...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Consider independent random variables

having the same probability distribution of density

with respect to Lebesgue’s measure.

The usual Parzen [11] - Rosenblatt [12] kernel estimator of is defined by

where and is a kernel. In 1969, Wolverton and Wagner introduced in [15] a variant of defined by

 (1) ˆfn,hn(x):=1nn∑k=11hkK(Xk−xhk),

where and . Thanks to its recursive form, this type of estimator is well-suited to online treatment of data: by denoting ,

 ˆfn+1,hn+1(x)=nn+1ˆfn,hn(x)+1(n+1)hn+1K(Xn+1−xhn+1).

Thus, up-dating the estimator when new observations are available is easy and fast.

We can mention here that several variants or generalizations of the Wolverton and Wagner (WW) estimator have been proposed: see Yamato [16], Wegman and Davies [14], Hall and Patil [5]. They were studied from almost sure convergence point of view, or asymptotic rates of convergence under fixed regularity assumptions. We choose to focus on Wolverton and Wagner estimator but our results and discussions may be applied to these.

Theoretical developments concerning either classical Parzen-Rosenblatt or WW recursive kernels estimators occurred recently following different and independent roads.
On the one hand, several recent works are dedicated to efficient and data-driven bandwidth selection, see Goldenshluger and Lespki [4] and several companion papers by these authors, or Lacour et al. [8] who proposed a modification of the method. The original Goldenshluger and Lepski (GL) method was difficult to implement because it turned out to be numerically consuming and with calibration difficulties, see Comte and Rebafka [3]. This is why the improvement proposed in Lacour et al. [8] has both theoretical and practical interest.
On the other hand, the increase of computer speed and of data sets sizes made fast up-dating of estimators mandatory. The theoretical developments in this context are in the field of stochastic algorithms (see e.g. Mokkadem et al. [10]) or in view of specific applications (see Bercu et al. [2]).

Bandwidths have to be chosen for WW estimators as for Parzen-Rosenblatt ones, and this choice is crucial to obtain good performances. This is why we propose to extend to this context general risk study as described in Tsybakov [13] and the GL method as improved by Lacour et al. [8]. More precisely, considering for instance for a parameter in formula (1), we study adaptive selection of . We prove risk bounds for the Mean Integrated Squares Error (MISE) of the resulting estimator where and .
Amiri [1] proved that for with regularity 2 and an adequate choice of the bandwidth, Parzen-Rosenblatt’s estimator had asymptotical smaller risk than the WW estimator. We propose an empirical finite sample study of this question, together with an interesting insight on the gain brought by higher order kernels.
Now, clearly, plugging in the estimator makes the recursivity fail. Therefore, a mixed strategy is required with initial estimation of on the first -sample and recursive up-dating relying on this "freezed" value on the following -sample, where should have the same order as . This is what is experimented in our final section, and empirically proved to be an appropriate strategy.

This paper provides in Section 2 controls of the MSE and of the MISE of the estimator under general regularity conditions on . Then, in Section 3, the well-known Goldenshluger-Lepski’s bandwidth selection method for Parzen-Rosenblatt’s estimator is extended to Wolverton-Wagner’s estimator. Lastly, an estimator in the spirit of Lacour et al. [8] is studied from both theoretical and practical point of view in Section 4. Concluding remarks present a mixed strategy in Section 5. Proofs are relegated in Section 6.

Notations:

1. Consider . The space of -Hölder continuous functions from into itself is denoted by and equipped with the -Hölder semi-norm defined by

 ∥φ∥α:=supx,y∈R:x≠y|φ(y)−φ(x)||y−x|α ; ∀φ∈Cα(R).
2. For and ,

 Σ(β):={φ∈Cl(R):∥φ(l)∥β−l<∞}.
3. For every square integrable functions ,

 (f∗g)(x):=∫∞−∞f(x−y)g(y)dy ; x∈R.
4. for every .

## 2. Bounds on the MSE and the MISE of Wolverton-Wagner’s estimator

Consider and . Throughout this section, the map fulfills the following assumption.

###### Assumption 2.1.

The map is integrable for every ,

 ∫∞−∞K(y)dy=1,∫∞−∞K2(y)dy<+∞ and ∫∞−∞yiK(y)dy=0,∀i∈⟦1,l⟧.

Let us establish a control of the MSE of Wolverton-Wagner’s estimator under the following condition on .

###### Assumption 2.2.

The map belongs to .

###### Proposition 2.3.

Under Assumptions 2.1 and 2.2, there exists a constant , not depending on and , such that

 E(|ˆfn,hn(x)−f(x)|2)⩽cn2⎛⎜⎝∣∣ ∣∣n∑k=1hβkl!∣∣ ∣∣2+n∑k=11hk⎞⎟⎠.

Now, let us establish a control of the MISE of Wolverton-Wagner’s estimator under Nikolski’s condition on .

###### Assumption 2.4.

The map belongs to and there exists a constant such that

 (∫∞−∞|f(l)(y+ε)−f(l)(y)|2dy)1/2⩽N(f)|ε|β−l ; ∀ε∈R.
###### Proposition 2.5.

Under Assumptions 2.1 and 2.4, there exists a constant , not depending on and , such that

 ∫∞−∞E(|ˆfn,hn(x)−f(x)|2)dx⩽cn2⎛⎜⎝∣∣ ∣∣n∑k=1hβk(l−1)!∣∣ ∣∣2+n∑k=11hk⎞⎟⎠.

Remark. Assumptions 2.1, 2.2 and 2.4 are standard for density estimation, see Tsybakov (2009). Moreover, if we set , we recover the results stated in Section 1.2.1 for Proposition 2.3 and in Theorem 1.3 for Proposition 2.5 in Tsybakov (2009) (that is a squared bias term of order

and a variance term of order

).

The estimator is consistent if the risk tends to zero when grows to infinity, that is if

 (2) Bn:=1n2∣∣ ∣∣n∑k=1hβk∣∣ ∣∣2 and Vn:=1n2n∑k=11hk

satisfy

Let us consider

 hk=k−γ ; k∈⟦1,n⟧

with (otherwise or cannot tend to zero). Then

 (3) Bn=O(1n2) if γβ>1%andBn=O(1n2γβ) if γβ<1

with the intermediate case

 Bn=O(log(n)n2) if γβ=1.

Indeed, if , then

 Bn=∣∣ ∣∣1nn∑k=1k−γβ∣∣ ∣∣2=n−2γβ∣∣ ∣∣1nn∑k=1(kn)−γβ∣∣ ∣∣2∼n−2γβ(1−γβ)2.

On the other hand,

 (4) Vn=O(nγ−1).

As a consequence, we have the following result:

###### Corollary 2.6.

Under Assumptions 2.1 and 2.4, choosing

 hk=k−γ ; k∈⟦1,n⟧,

with

 γ=12β+1

yields the rate

 ∫∞−∞E(|ˆfn,hn(x)−f(x)|2)dx⩽cn−2β2β+1,

where is a positive constant which does not depend on .

Clearly, this is the optimal rate in the minimax sense, see Goldenshluger and Lepski [4] and the references therein.

###### Proof.

Consider

 φn(γ):=n−2γβ+nγ−1.

Then,

 ∂φn(γ)∂γ=log(n)(−2βe−2γβlog(n)+e(γ−1)log(n)).

Moreover, if and only if,

 γ=12β+1+log(2β)log(n)(1+2β)∼12β+1.

Therefore, makes the upper bound on the risk minimal. ∎

## 3. Goldenshluger-Lepski’s method for Wolverton-Wagner’s estimator

This section provides an extension of the well-known Goldenshluger-Lepski’s bandwidth selection method for Parzen-Rosenblatt’s estimator to Wolverton-Wagner’s estimator.

Throughout this section, assume that

 hk=hk(γ) ; ∀k∈⟦1,n⟧,

where and the maps from into fulfill the following assumption.

###### Assumption 3.1.

For every ,

 0

Moreover, is decreasing and one to one from into .

For instance, one can take as above for every and .

Consider

 hn(γ):=(h1(γ),…,hn(γ))

and the set , where and

 0<γ1<⋯<γN(n)⩽h−1n(1/n).

Consider also

 ˆfn,γ,γ′(x):=1nn∑k=1(Khk(γ′)∗Khk(γ))(Xk−x),

where .

A way to extend the Goldenshluger-Lepski bandwidth selection method to Wolverton-Wagner’s estimator is to solve the minimization problem

 (5) minγ∈Γn(An(γ)+Vn(γ)),

where

 An(γ):=supγ′∈Γn(∥ˆfn,hn(γ′)−ˆfn,γ,γ′∥22−Vn(γ′))+,Vn(γ′):=υnhn(γ′)

with not depending on and

 1hn(γ′):=1nn∑k=11hk(γ′).

In the sequel, the map fulfills the following assumption.

###### Assumption 3.2.

For every and ,

 supn∈N∗∑γ′∈Γnexp(−c/hn(γ′)r)<∞.

Example. Consider

 hk(γ′)=k−γ′ ; ∀k∈⟦1,n⟧, ∀γ′∈[0,1]

and

 (6) Γn=⎧⎨⎩(ilog(n))1/2 ; i∈⟦1,[log(n)]⟧⎫⎬⎭.

For every ,

 1hn(γ′) = 1nn∑k=11hk(γ′)=nγ′−1n∑k=1(kn)γ′ ⩾ nγ′−2n∑k=1k⩾nγ′2⩾12exp(log(n)1/2).

Then, for any and ,

###### Proposition 3.3.

Under Assumptions 3.1 and 3.2, if is bounded and is a solution of the minimization problem (5), then there exists a constant , not depending on , such that

 E(∥ˆfn,hn(ˆγn)−f∥22)⩽c⎧⎨⎩infγ∈Γn⎛⎝Vn(γ)+1n2∣∣ ∣∣n∑k=1∥f−Khk(γ)∗f∥2∣∣ ∣∣2⎞⎠+1n⎫⎬⎭.

If in addition Assumptions 2.1 and 2.4 hold, then

 (7) E(∥ˆfn,hn(ˆγn)−f∥22)⩽c{infγ∈Γn(Bn(γ)+Vn(γ))+1n}

where and are defined in (2), (3) and (4).

Remark. By Corollary 2.6, the infimum in bound (7

) has the order of the optimal rate, and is reached automatically by the data driven estimator. This result is more precise than the heuristics associated with cross-validation.

We mentioned previously that the optimal theoretical choice for under Assumptions 2.1 and 2.4 is . Here, the selected should be at nearest of this value, e.g. if is as in (6), distant from less than of the good choice. We may therefore consider that provides an estimate of and thus an estimate of the regularity of (at least for huge values of ).

## 4. The Lacour-Massart-Rivoirard (LMR) estimator

### 4.1. Estimator and main result

The Goldenshluger-Lepski method has been acknowledged as being difficult to implement, due to the square grid in required to compute intermediate versions of the criterion and to the lack of intuition in the choice of the constant which should be calibrated from preliminary simulation experiments. This is the reason why Lacour et al. [8] investigated and proposed a simplified criterion relying on deviation inequalities for -statistics due to Houdré and Reynaud-Bouret [6]. This inequality applies in our more complicated context and Lacour-Massart-Rivoirard’s result can be extended here as follows.

Let us recall that for every and set

 fn,γ(x):=E(ˆfhn(γ)(x))=1nn∑k=1(Khk(γ)∗f)(x).

Let be the maximal proposal in and consider

 Crit(γ):=∥ˆfn,hn(γ)−ˆfn,hn(γmax)∥22+pen(γ)

with

 pen(γ):=2n2n∑k=1⟨Khk(γmax),Khk(γ)⟩2.

Then, we define

 ˜γn∈argminγ∈ΓnCrit(γ).

In the sequel, , and fulfill the following assumption.

###### Assumption 4.1.

The kernel is symmetric, ,

 ∫∞−∞K(y)dy=1, ∥K∥∞∥K∥1nhn(γmax)⩽1

and .

###### Proposition 4.2.

Consider and . Under Assumption 4.1, there exists three deterministic constants , not depending on , and , such that with probability larger than ,

 ∥ˆfn,hn(˜γn)−f∥22 ⩽ +c2ε∥fn,γmax−f∥22+c3ε(λ2n+λ3n2hn(γmax)).

Remark. The term is negligible because it is a pure bias term for smallest bandwidth (e.g., under Assumption 2.4, it has order , see (3), and thus if is near of 1 and ). The terms following are of order and are always negligible compared to nonparametric rates in our setting. Therefore, the bound given in Proposition 4.2 says that the MISE of the adaptive estimator has the order of the best estimator of the collection, up to a multplicative factor larger than 1. This is the method we implement in the next section: it is faster than GL method and with no constant to calibrate in the penalty.

### 4.2. Simulation experiments

We consider basic densities with different types and orders of regularity:

• , density ,

• a mixed gaussian , density ,

• , density ,

• a mixed beta , density ,

• , density ,

• a mixed gamma , density ,

• with , a Laplace density.

The densities and have infinite regularity, and should rather have regularity of order less than 2, and less than 4, and less than 1. This choice should allow to study the influence of the order of the kernel.

Denoting by the density of a centered Gaussian random variable with variance equal to , we consider the following kernels:

• a Gaussian kernel, which is of order 1,

• a Gaussian-type kernel of order 3, ,

• a Gaussian-type kernel of order 5, ,

• a Gaussian-type kernel of order 7, .

With all these kernels, the penalty terms are computed analytically and without approximation. Indeed, for , it holds that

 ⟨ni,h1,nj,h2⟩2=∫∞−∞ni,h1(x)nj,h2(x)dx=1√2π×1√ih21+jh22.

We compute the variable bandwidth estimator as described in Section 4 and select in a collection of equispaced values between 0 and 0.5 while the bandwidth associated with observation is . We also compute the original estimator of Lacour et al. [8] with bandwidth which does not depend on the observation and is selected among values in the set .

For comparison, we give the performance of the Matlab density estimator obtained from function (denoted by ks in Table 1), which entails a different bandwidth selection method and relies on a gaussian kernel.

We compute the integrated -risk associated with all the final estimators, evaluated at equispaced points in the range of the observations, averaged over repetitions:

 1KK∑j=1b−aPP∑ℓ=1(ˆf(j)ˆh(j)(xℓ)−f(xℓ))2, xℓ=a+ℓb−aP,

where is the estimator computed for path .

Results are gathered in Table 1 and deserve some comments. As expected, when increasing from 250 to 1000, the resulting MSEs decrease and seem to be more improved in LMR methods of both types than for ks estimator. Increasing the order of the kernel systematically improves the results, except for the lowest regularity density , which is at best with , but it is interesting to note that taking higher order kernel is always a good strategy: if a loss occurs, it is negligible while the improvement, when it happens, is in all cases significant. Estimator ks fails for all mixed densities , and and provides rather bad results in these cases, for both sample sizes. For the other densities (, , , ), the results obtained with kernel and LMR method are better than with ks for (Gaussian case), and of comparable order in all other cases. Now if we compare the LMR and WW-LMR results both with kernel , we conclude that the WW-LMR method wins in 10 cases out of 14, but not significantly.

The first line of Figure 1 illustrates in the left picture the way Matlab estimator fails for mixed densities (here the mixed Gaussian ) by probably selecting a too large bandwidth, here . The two LMR estimators are almost confounded. The middle and right pictures present the estimators among which the LMR procedure makes the selection, for the same path: we observe that the collection of proposals are rather different. The second line of Figure 1 presents the same type of results for density , and sample size . Figure 2 shows beams of 30 final estimators for sample size , for the three estimators LMR-WW with , LMR with and ks, showing very similar behaviours.

A last remark corresponding to numerical results we do not report in detail is the following. For most densities, the value of selected by the LMR strategy decreases, and the value of increases, when the order of the kernel increases. Exceptions are densities with lower regularity (the beta , mixed beta and Laplace densities) for which the last value of selected with is less than the one selected with . This illustrates the fact that, asymptotically, if is the regularity index of the density and the order of the kernel, the optimal choice is for of order and for , .

## 5. Concluding remarks

Our study illustrates that bandwidth selection is an important step for kernel functional estimation, and recent methods are really powerful whatever the type of density to recover.
Our simulations show also that, even if it implies non necessarily nonnegative kernels and thus density estimators, increasing the order of the kernel improves the estimation both in the theory and in practice. Also, we proved that variable bandwidth for WW-type estimators can reach excellent rates, again both in theory and in practice, provided that adaptive choice of this variable bandwidth is performed. The orders of practical MISEs show that this WW-strategy provides results of the same order as the more classical bandwidth methods.

However, one may wonder how to keep these ideas compatible with recursive procedures and online updating of the kernel estimator. We believe that the adaptive bandwidth, whatever its type, can be selected on a preliminary sample and then, "freezed" to this selected value and plugged in the estimator. We tested this strategy on simulations: for each sample, another independent sample is generated, and a second estimator is computed based on the new sample, with the value of or selected for the first data set. Table 2 provides the MISEs obtained for the estimator with adaptive bandwidth with kernel for WW-estimator (column ) or for NW-estimator (column ), for comparison with MISEs of the estimator computed on an independent sample with the values based on the previous selections (columns "Fixed " and "Fixed "). Surprisingly, the results have exactly the same orders (we expected a slight deterioration): this proves that the proposal of preliminary selection is really fine.

## 6. Proofs

### 6.1. Proof of Proposition 2.3

First, by the bias-variance decomposition,

 (8) E(|ˆfn,hn(x)−f(x)|2)=bn(f,x)2+var(ˆfn,hn(x))

where

 bn(f,x):=E(ˆfn,hn(x))−f(x).

Let us find controls for and .

On the one hand,

 var(ˆfn,hn(x)) = ⩽ 1n2n∑k=11h2k∫∞−∞K(y−xhk)2f(y)dy = 1n2n∑k=11hk∫∞−∞K(z)2f(hkz+x)dz ⩽ c1n2n∑k=11hk,

where On the other hand,

 bn(f,x) = −f(x)+1nn∑k=11hkE(K(Xk−xhk)) = −f(x)+1nn∑k=1∫∞−∞K(z)f(hkz+x)dz = 1nn∑k=1∫∞−∞K(z)(f(hkz+x)−f(x))dz.

For every and , by Taylor-Lagrange’s formula there exists such that

 f(hkz+x)−f(x)=l−1∑i=1(hkz)ii!f(i)(x)+(hkz)ll!f(l)(τhkz+x).

Then, by Assumption 2.1,

 bn(f,x) = 1nn∑k=1∫∞−∞K(z)(f(hkz+x)−f(x))dz = 1nn∑k=1(l−1∑i=1hiki!f(i)(x)∫∞−∞ziK(z)dz+hlkl!∫∞−∞zlK(z)f(l)(τhkz+x)dz) = 1nn∑k=1hlkl!∫∞−∞zlK(z)f(l)(τhkz+x)dz = 1nn∑k=1hlkl!∫∞−∞zlK(z)(f(l)(τhkz+x)−f(l)(x))dz.

Therefore, by Assumption 2.2,

 |bn(f,x)| ⩽ 1nn∑k=1hlkl!∫∞−∞zl⋅|K(z)|⋅|f(l)(τhkz+x)−f(l)(x)|dz ⩽ c2nn∑k=1hβkl!

where In conclusion, by Equation (8), setting , we get

 E(|ˆfn,hn(x)−f(x)|2)⩽cn2⎛⎜⎝∣∣ ∣∣n∑k=1hβkl!∣∣ ∣∣2+n∑k=11hk⎞⎟⎠.□

### 6.2. Proof of Proposition 2.5

In order to prove Proposition 2.5, the two following generalizations of Minkowski’s inequality are required.

###### Lemma 6.1.

For any Borel function , if is integrable and

 y⟼∫∞−∞φ(y,z)dz

is a Borel function, then

1. .

2. .

###### Proof.

Result (1) is proved in Tsybakov [13], see Lemma A.1 for a proof.

The proof of Lemma 6.1.(2) is nearly the same, but we provide it for the sake of completeness. Assume that

 Mφ:=n∑k=1(∫∞−∞φ(k,z)2dz)1/2<∞

and consider

 S(z)=n∑k=1φ(k,z) ; ∀z∈R.

For every ,

 ∣∣∣∫∞−∞f(z)S(z)dz∣∣∣ ⩽ ∫∞−∞|f(z)|n∑k=1|φ(k,z)|dz=n∑k=1∫∞−∞|f(z)|⋅|φ(k,z)|dz ⩽ ∥f∥2n∑k=1