# Kernel Density Estimation Bias under Minimal Assumptions

Kernel Density Estimation is a very popular technique of approximating a density function from samples. The accuracy is generally well-understood and depends, roughly speaking, on the kernel decay and local smoothness of the true density. However concrete statements in the literature are often invoked in very specific settings (simplified or overly conservative assumptions) or miss important but subtle points (e.g. it is common to heuristically apply Taylor's expansion globally without referring to compactness). The contribution of this paper is twofold (a) we demonstrate that, when the bandwidth is an arbitrary invertible matrix going to zero, it is necessary to keep a certain balance between the kernel decay and magnitudes of bandwidth eigenvalues; in fact, without the sufficient decay the estimates may not be even bounded (b) we give a rigorous derivation of bounds with explicit constants for the bias, under possibly minimal assumptions. This connects the kernel decay, bandwidth norm, bandwidth determinant and density smoothness. It has been folklore that the issue with Taylor's formula can be fixed with more complicated assumptions on the density (for example p. 95 of "Kernel Smoothing" by Wand and Jones); we show that this is actually not necessary and can be handled by the kernel decay alone.

## Authors

• 12 publications
05/24/2017

### Consistent Kernel Density Estimation with Non-Vanishing Bandwidth

Consistency of the kernel density estimator requires that the kernel ban...
03/25/2019

### β-Divergence loss for the kernel density estimation with bias reduced

Allthough nonparametric kernel density estimation with bias reduce is no...
04/26/2021

### Data-Based Optimal Bandwidth for Kernel Density Estimation of Statistical Samples

It is a common practice to evaluate probability density function or matt...
10/25/2018

### Adaptive Density Estimation on Bounded Domains

We study the estimation, in Lp-norm, of density functions defined on [0,...
02/14/2011

### Dual-Tree Fast Gauss Transforms

Kernel density estimation (KDE) is a popular statistical technique for e...
06/27/2012

### Faster Gaussian Summation: Theory and Experiment

We provide faster algorithms for the problem of Gaussian summation, whic...
04/03/2015

### The Gram-Charlier A Series based Extended Rule-of-Thumb for Bandwidth Selection in Univariate and Multivariate Kernel Density Estimations

The article derives a novel Gram-Charlier A (GCA) Series based Extended ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Kernel Density Estimation

#### Density estimation by convolutions

Density estimation is the fundamental problem of approximating a probability density function

given a set of iid samples . The popular approach, called Kernel Density Estimation, uses a convolution of a suitable filter (called kernel) with the sample distribution . Formally, the KDE estimator is defined by

 ^f(x′)=K⋆fD(x′)=∫Kh(x′−x)fD(x)dx (1)

and in this form is credited to Rosenblatt and Parzen [Ros56, Par62]. Usually one uses rescaled versions of a base kernel

 Kh(u)=|h|−1⋅K(h−1u) (2)

where the scale parameter is a invertible matrix called bandwidth and is the matrix determinant (for simplicity one often considers diagonal ). Under certain assumptions on the kernel (rapid decay, moments) and the density (smoothness), the KDE estimator is consistent asymptotically, that is when . Intuitively, the convergence follows as for close to we have by the smoothness of , and for larger the possible bias is penalized by the scaled kernel as is big for small . Specific bounds depends on the kernel and local smoothnes of .

#### Estimator Accuracy

The variance of the estimator is quite easy to compute

 Var(^f(x′))=|D|−1⋅Varx∼f[Kh(x′−x)] (3)

and (under some assumptions on ) is of order with the hidden constant dependent on . In turn, bias is obtained by exchanging expectation and the convolution integral

 bias(^f(x′))=ED[K⋆fD(x′)−f(x′)]=Kh⋆f(x′)−f(x′) (4)

Intuitively, it captures by how much the convolution perturbs the density; this in turn depends on how the kernel interacts with the local series expansion of . Expanding around and parametrizing one obtains a series where

 μk(x′,h)=⎧⎨⎩f(x′)⋅(∫K(u)−1)duk=0(−1)kk!∫K(u)Dkf(x′)((hu)(k))duk=1,2,… (5)

the -th derivative is understood as a -linear map from to and

denotes the vector

stacked -times; here one needs to some assumptions on the kernel and derivatives to guarantee that the integrals exist.

In general , so one designs the filter to eliminate low-order terms:

1. (unit mass) when , the bias is of order

2. (symmetry) if in addition , the bias improves to 111 is a weighted sum of terms , which are zero when is symmetric..

The best, over the choice of , MSE error equals then (pointwise, for fixed )

 MSE=O(n−44+d),∥h∥=Θ(n−14+d) (6)

This improves upon histograms (they have error ). Cacoullos [Cac64] gives a rigorous derivation of the bias and variance formulas for diagonal .

#### Better accuracy with higher-order kernels

One can farther reduce bias by eliminating more of the expansion terms. Such kernels are also called higher-order kernels and compensate the negative impact of dimension

on the variance (curse of dimensionality). If the property

holds for one says the kernel is of order ; the bias is of order which (for the optimized bandwith) gives the mse error of order  [EH09]. Higher-order kernels can be built as products of single-dimension higher-order kernels; the problem of developing one-dimensional filters from Taylor expansions was studied in [MMMY97].

### 1.2 Contribution of this paper

The fundamental properties of kernel estimators, including bias and variance, are generally well understood. However the concrete statements in the existing literature are based on various assumptions; sometimes they are overly simplistic, sometimes too conservative, and finally sometimes important assumptions are ignored. We mention few prominent examples, to be specific:

• Bandwidth is scalar or diagonal [Cac64], or is given by rescaling a fixed matrix [Jia17, Cac64]

• For second-order kernels, smoothness of the density of order or higher is assumed [ZD13, Cac64]

• Taylor’s expansion is used globaly which suggest that the kernel decay is not needed [YC18, EH09] without referring to compactness or taking compact arguments to hold globally [DUO05]. In fact without sufficient decay the estimates are not even bounded (we will discuss a general example).

The purpose of this paper is to give a rigorous bounds on the bias, under minimal constraints on the bandwidth matrix and the kernel decay. Particularly, we discuss what happens when the bandwidth elements goes to zero at different rates.

## 2 Results

### 2.1 Necessary kernel decay and bandwidth eigenvalues balance

The -th moment of the kernel is defined as . The following construction shows that, to reconstruct the density from its behavior in a fixed neighborhood, the kernel decay and discrepancy of eigenvalues of must be balanced. This is true regardless of smoothness and moments of (note that bounded moments do not imply decay!).

###### Theorem 2.1 (Lower bound on bias in terms of kernel decay and bandwidth eigenvalues)

For any there exists a radial kernel on which is infinitely differentiable, has finite first moments, and decay rate at infinity not faster than with the following property: over the class of densities with given behavior on the unit ball

 F={f:f(u)=f0(u)∀u:|u|⩽1}, for any f0: ∫|u|⩽1f0(u)du<1

the density estimation at is lower-bounded by

 maxf∈F[Kh⋆f(0)]=Ω(1)⋅|λ1|p/d∏i=1|λi| (7)

where are eigenvalues of ordered so that .

###### Proof

Consider a non-negative ”radial” kernel on where is a non-negative real function such that

 supr∈ZrpK(r)=Ω(1) (8)

for some fixed , the supremum being over integers. For example, let be the standard bump function. Now for some constant consider

 k(r)=c⋅∑n∈Z,|n|⩾2|n|−p⋅Ψ(2|n|p+ℓ+d+1r−n)

the sum of shifted and rescaled bump functions - the -th is component centered at with the interval width and the spike of magnitude . Clearly is analytic because each point is covered by finitely many smooth components (actually by at most one) Moreover, has integrable moments up to order

 ∫|rjk(r)|dr⩽∑n∈Z,|n|⩾2np+j⋅n−p−d−ℓ−1<∞,j=0,…,d+ℓ−1.

It is well-known that for radial functions it holds (by the spherical parametrization). Therefore defined from our is indeed integrable and has all moments up to ; by manipulating we can normalize the integral to . Note also that is infinitely differentiable in , also at because in the neighborhood of zero by definition. Now, since and are positive

 Kh⋆f(0)=|h|−1∫K(−h−1u)f(u)du=Ω(1)⋅|h|−1∫|u|>1|h−1u|−pf(u)du

The class represents all functions with same behavior on the unit ball as the function . The maximum of the expression above over this class equals

 maxf∈F |h|−1∫|u|>1|h−1u|−pf(u)du=|h|−1sup|u|>1|h−1u|−p

Note that the supremum is achieved on the boundary (consider scaling by a scalar ). We have then for some

 Kh⋆f′(0)=Ω(1)|h|−1sup|u|=1|h−1u|−p

This is equivalent to

 Kh⋆f′(0)=Ω(1)⋅|h|−1(inf|u|=1|h−1u|)−p (9)

We can use the max norm because of equivalence of all vector norms. Let where be the eigenvalues of . Then

are eigenvectors of

. Let be the vector such that ; it follows that ; since one obtains

 Kh⋆f′(0)=Ω(1)⋅|λ1|pd∏i=1|λi|−1

this finishes the proof.

From Equation 7 it is clear that when one needs not only to decay at least as fast as the negative power of (with and the estimate is unbounded) but also to keep some balance between the bandwidth eigenvalues. We note that in  [Cac64], for the simpler case of product kernels and diagonal bandwidth, one assumes that where is a positive diagonal matrix and ; this implies that eigenvalues are of comparable order.

###### Remark 1

Note that the kernel in this argument is non-compact, but has moments up to an arbitrary fixed order.

### 2.2 Multivariate KDE bias under general bandwidths

We give a fairly general bounds on the bias below. Note that formulas often cited in the literature, such as p. 95 in [WJ94] are limited to compact . The authors suggest that fixing this can be done at the cost of assuming more on the density 222p. 95 in [WJ94] in : ”the assumptions of the compact support of can be removed by imposing more complicated conditions on We show that that extra conditions on are actually not necessary, and kernels with non-compact support can be handled by the decay.

###### Theorem 2.2 (General bias formula)

Let be a -th order kernel with bounded moments up to . Suppose that has -th derivatives bounded in a -neighborhood of . Then the remainder in the bias expansion equals

 bias(^f(x′))−k∑i=0μk(x′,h)=Rk(x′,h)⋅∥h∥2

where are defined in Equation 5 and for any

 |Rk(x′,h)|⩽2|h|−1sup|u|>δ/∥h∥K(u)+C⋅μK(k)⋅1k!sup|u|⩽δ∥Dkf(x′+u)∥

where the constant depends only on the chosen norm.

###### Corollary 1 (Bias under k-th order kernels)

If , are as in Theorem 2.2, and

1. decays at infinity faster than the negative power of

then the remainder is .

###### Remark 2 (Balance of eigenvalues)

Note that can be easily unbounded (consider diagonal matrix with different entries). In the opposite direction by Hadamard’s Inequality [Lan14] we have that is bounded. If are eigenvalues of then and ; thus implies that all eigenvalues are of same magnitude.

###### Remark 3

One can allow for larger discrepancy between and with faster decay of the kernel.

In the proof we will use the multivariate Taylor formula with the integral remainder form. To get terms up to the -th order we assume that -th derivatives exist and are locally bounded. It might be possible to further weakened the assumptions, e.g. to ue the Taylor formula when -th derivatives are absolutely continuous [AD01].

###### Lemma 1 (Multivariate Taylor’s Formula [Con06, Ad01])

Let be a compact convex set in and let have absolutely continuous -th derivatives. Then for any and such that

where

 Rx(h)=∫10(1−t)k−1(k−1)!(Dkg(x+t⋅h)−Dkg(x))(h(k))dt.
###### Proof (of Theorem)

We split the convolution integral

 Kh⋆f(x′)−f(x′)=∫K(u)(f(x′−hu)−f(x′))du (10)

integral in two regions: and where will depend on . The general strategy is as follows: ”big” values of are handled by the decay of , whereas ”small” are worked out by the smoothness of . Consider first ”big”

Let . By the properties of the matrix norm . Therefore in the region of integration and since is decreasing. Since , we obtain

 I1⩽2|h|−1ψ(δ∥h∥−1|) (11)

Consider now the case of ”small” values of . We assume that so that we can apply the Taylor formula. The main terms are as in Equation 5 and are well defined provided that is absolutely integrable and that exists at . It suffices to consider the remainder. Let

 I2=|h|−1∫|u|⩽δK(h−1u)∣∣Dkf(x′−tu)(u)(k)∣∣du (12)

for any fixed . If we bound this integral uniformly in , let’s say then according to Lemma 1 we will get . Let’s change variables . We have

 I2=t−k−d|h|−1∫|v|⩽δtK(t−1h−1v)∣∣Dkf(x′−v)(v)(k)∣∣dv

By the properties of multilinear maps

 ∣∣Dkf(x′−v)(v)(k)∣∣=O(1)∥∥Dkf(x′−v)∥∥v|k

where the constant depends only on the chosen norms. We obtain

 I2⩽O(1)⋅t−k−d|h|−1∫|v|⩽δt|v|kK(t−1h−1v)∥∥Dkf(x′−v)∥∥dv

Now if for , we obtain

 I2 ⩽O(1)⋅B(δ)⋅∫|hu|⩽δ|hu|kK(u)du ⩽O(1)⋅B(δ)⋅∥h∥k∫|hu|⩽δ|u|kK(u)du

where we changed variables and used the norm inequality . Since is integrable, we obtain

 I2⩽O(1)⋅B(δ)⋅μK(k)⋅∥h∥k (13)

The result follows by combining Equation 11 and Equation 13.

## References

• [AD01] G.A. Anastassiou and S.S. Dragomir, On some estimates of the remainder in taylor’s formula, Journal of Mathematical Analysis and Applications 263 (2001), no. 1, 246 – 263.
• [Cac64] Theophilos Cacoullos, Estimation of a multivariate density, https://www.ism.ac.jp/editsec/aism/pdf/018_2_0179.pdf, 1964.
• [Con06] Brian Conrad, Higher derivatives and taylor’s formula via multilinear maps, http://math.stanford.edu/~conrad/diffgeomPage/handouts/taylor.pdf, 2006.
• [DUO05] Convergence rates for unconstrained bandwidth matrix selectors in multivariate kernel density estimation

, Journal of Multivariate Analysis

93 (2005), no. 2, 417 – 433.
• [EH09] Bruce E. Hansen, https://www.ssc.wisc.edu/~bhansen/718/NonParametrics1.pdf, 2009.
• [Jia17] Heinrich Jiang, Uniform convergence rates for kernel density estimation

, Proceedings of the 34th International Conference on Machine Learning, vol. 70, PMLR, 2017, pp. 1694–1703.

• [Lan14] Kenneth Lange, Hadamard’s determinant inequality, The American Mathematical Monthly 121 (2014), no. 3, 258–259.
• [MMMY97] Torsten Möller, Raghu Machiraju, Klaus Mueller, and Roni Yagel, Evaluation and design of filters using a taylor series expansion, IEEE Transactions on Visualization and Computer Graphics 3 (1997), no. 2, 184–199.
• [Par62] Emanuel Parzen, On estimation of a probability density function and mode, Ann. Math. Statist. 33 (1962), no. 3, 1065–1076.
• [Ros56] Murray Rosenblatt, Remarks on Some Nonparametric Estimates of a Density Function, The Annals of Mathematical Statistics 27 (1956), no. 3, 832–837.
• [WJ94] M.P. Wand and M.C. Jones, Kernel smoothing, Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Taylor & Francis, 1994.
• [YC18] Chen Yen-Chi, Introduction to nonparametric statistics - lecture notes, http://faculty.washington.edu/yenchic/18W_425/Lec6_hist_KDE.pdf, 2018.
• [ZD13] Adriano Z. Zambom and Ronaldo Dias, A Review of Kernel Density Estimation with Applications to Econometrics, International Econometric Review (IER) 5 (2013), no. 1, 20–42.