# Kernel Density Estimation by Stagewise Algorithm with a Simple Dictionary

This study proposes multivariate kernel density estimation by stagewise minimization algorithm based on U-divergence and a simple dictionary. The dictionary consists of an appropriate scalar bandwidth matrix and a part of the original data. The resulting estimator brings us data-adaptive weighting parameters and bandwidth matrices, and realizes a sparse representation of kernel density estimation. We develop the non-asymptotic error bound of estimator obtained via the proposed stagewise minimization algorithm. It is confirmed from simulation studies that the proposed estimator performs competitive to or sometime better than other well-known density estimators.

• 2 publications
• 2 publications
03/03/2022

### Kernel Density Estimation by Genetic Algorithm

This study proposes a data condensation method for multivariate kernel d...
04/08/2019

### Modeling a Hidden Dynamical System Using Energy Minimization and Kernel Density Estimates

In this paper we develop a kernel density estimation (KDE) approach to m...
04/03/2015

### The Gram-Charlier A Series based Extended Rule-of-Thumb for Bandwidth Selection in Univariate and Multivariate Kernel Density Estimations

The article derives a novel Gram-Charlier A (GCA) Series based Extended ...
04/18/2015

### Fast optimization of Multithreshold Entropy Linear Classifier

Multithreshold Entropy Linear Classifier (MELC) is a density based model...
02/04/2019

### Numerical performance of Penalized Comparison to Overfitting for multivariate kernel density estimation

Kernel density estimation is a well known method involving a smoothing p...
10/31/2007

05/04/2018

### Axiomatic Approach to Variable Kernel Density Estimation

Variable kernel density estimation allows the approximation of a probabi...

## Acknowledgments

The second author gratefully acknowledges the financial support from KAKENHI 19K11851.

## 1 Introduction

Let , , be -dimensional i.i.d. sample generated from the true density function on . General representation of multivariate Kernel Density Estimator (KDE) is written as

 ˆfH(x)=N∑i=1αiKHi(x−Xi), (1)

where , , is a symmetric and positive definite -dimensional bandwidth matrix used for the data , is a non-negative real valued bounded kernel function, and , , is the weighting parameters assigned for the data .

One approach to implement (1) is setting and estimating efficiently. This approach emphasizes finding efficient full-bandwidth matrix, instead of putting simple assumptions on weighting parameters. Duong and Hazelton (2003) propose the Direct Plug-In (DPI) bandwidth matrix while setting a bivariate full-bandwidth matrix. We denote the estimator using the DPI bandwidth matrix to be .

Another approach is the Redused Set Density Estimator (RSDE) in Girolami and He (2003), which firstly employs the scalar bandwidth matrix , where is the

-dimensional identity matrix and the constant

is determined by cross-validation. Second, the parameters , , are estimated to minimize Integrated Squared Error (ISE) under the constraint , , . RSDE imposes simple assumptions on the bandwidth matrix, but requires more efforts in calculating the weighting parameters. RSDE also allows for some ’s, realizing the sparse representation of kernel density estimation because those data points are not used in the estimation. We denote the estimator using RSDE to be .

Other than these approaches, algorithm-based methods have also been developed such as projection pursuit density estimation (Friedman et al. 1984) and boosting by Ridgeway (2002). In relation to boosting, Klemelä (2007) developed a density estimation using stagewise algorithm and its non-asymptotic error bounds. Naito and Eguchi (2013) developed the stagewise algorithm under the setting of -divergence. The stagewise algorithm requires a dictionary beforehand where the words consist of density functions; it starts by choosing a density function from the dictionary which minimizes the empirical loss, and proceeds in a stage-wise manner, adding new simple functions to the convex combination.

In this paper, we consider applying the stagewise algorithm in Klemelä (2007) and Naito and Eguchi (2013) for the kernel density estimator in (1). We randomly split an i.i.d. sample into the two disjoint sets, one to be used for the means of the kernel functions in the dictionary and the other for calculating the criterion function, and implement the stagewise algorithm. The outcome is expressed in the form of (1) and brings us the data-adaptive weighting parameters , while virtually realizing the data-adaptive bandwidth matrix through a variation in the bandwidths in the dictionary. It also chooses the data points of no use for the estimation, to obtain a sparse representation of kernel density estimation just like RSDE. We are especially interested in ascertaining whether or not our estimators can outperform its competitors, KDE and RSDE, in terms of estimation error and the degree of data condensation, while making the dictionary as simple as possible in terms of its bandwidth matrix structure.

The remainder of this paper is organized as follows. In Section 2, we introduce the evaluation criterion for our proposed method, -divergence. Section 3 describes our proposed method. Section 4 shows the theoretical results of our estimator, the non-asymptotic error bound of the estimator. We show the simulation results and real data example used in our method in Section 5. The discussion and conclusions are presented in Section 6. In Appendices A and B, we provide the proofs of the theorems for the non-asymptotic error bounds of the proposed estimator and its normalized version in Section 4 respectively. In Appendices C and D, we show details of the related results in Section 4.

## 2 U-divergence

To compose the algorithm, we employ -divergence defined as the distance between the fixed and any density function written as

 DU(f,g)=∫Rd[U(ξ(g(x)))−U(ξ(f(x)))−f(x){ξ(g(x))−ξ(f(x))}]dx≥0, (2)

where is a strictly convex function on , and . The equality of (2) holds if and only if except the set of measure zero. The non-negative property of is explained by the convex property of . The functional form of -divergence is similar to that of the Bregman divergence (see Bregman 1967; Zhang et al. 2009, 2010).

Extracting the part relating to from (2), we obtain

 LU(g)=−∫Rdf(x)ξ(g(x))dx+∫RdU(ξ(g(x)))dx. (3)

Replacing the first term in the right-hand side of (3) with its empirical form, we obtain the empirical

-loss function written as

 ˆLU(g)=−1NN∑i=1ξ(g(Xi))+∫RdU(ξ(g(x)))dx. (4)

Minimizing (4) with respect to is equivalent to minimizing the empirical form of (2) for a fixed .

If we specify the convex function to be the following -power function with a tuning parameter :

 Uβ(t)=1β+1(1+βt)β+1β,   0<β≤1,

we obtain the resulting divergence function

 Dβ(f,g)=1β+1∫Rd{g(x)β+1−f(x)β+1}dx−1β∫Rdf(x){g(x)β−f(x)β}dx, (5)

which is called the -power divergence (see Basu et al. 1998; Minami and Eguchi 2002). We notice that the limit of is equivalent to Kullbuck-Leibler (KL) divergence as goes to zero because . Alternatively, when , is equivalent to norm. We also notice that the -power divergence with exhibits robustness property, judging from the functional form of (5); employing -divergence enables us to consider a variety of density estimators in one function.

## 3 The method

Supposed that we have i.i.d. sample , , generated from . For this i.i.d. sample, we define , , and use it for the dictionary. For the rest of the i.i.d. sample, we define , , , and use it for the algorithm to calculate empirical loss. Let be a set of -dimensional scalar bandwidth matrices , ,

 B = {h2jId∣∣∣j=1,2,...,|B|}.

Each element of is predetermined by users before starting the algorithm. Then, we define the dictionary,

 D={ϕhj(⋅−X∗i)∣∣∣h2jId∈B,i=1,2,...,m,j=1,2,...,|B|},  |D|=m×|B|, (6)

where

is a density function with its mean and variance-covariance matrix respectively

and . Each word in is denoted to be , where each index number corresponds to a combination , , , , one-to-one.

#### Stagewise minimization algorithm

Let () be the number of iterations for the algorithm. Let be the approximation bound. We employ the mixing coefficients,

 πk=1 and 0<πk=θk+θ<1, k=1,...,M−1,with  θ≥2. (7)

From (4), the empirical loss is calculated by

 ˆLU(g(⋅|X∗))=−1nn∑i=1ξ(g(X+i|X∗))+∫RdU(ξ(g(x|X∗)))dx,

where is a function on given . Then, the algorithm for the stagewise minimization estimator consists of the following steps:

Step1. At the initial stage , choose so that

 ˆLU(~f0(⋅|X∗)) ≤ infϕ∈DˆLU(ϕ(⋅|X∗))+ϵ.

Step2. For , let

 ~fk(x|X∗) = u((1−πk)ξ(~fk−1(x|X∗))+πkξ(~ϕ(x|X∗))),

where is chosen so that

 ˆLU(~fk(⋅|X∗)) ≤ infϕ∈DˆLU(u((1−πk)ξ(~fk−1(⋅|X∗))+πkξ(ϕ(⋅|X∗))))+πkϵ.

Step3. Let .

At the final step , we obtain the sequence of the words chosen at each stage,
,,…,, and the density estimator using the algorithm has the form of

 ˆf(x|X∗) = u(M−1∑l=0qlξ(~ϕl(x|X∗))),   ql=πlM−1∏t=l+1(1−πt). (8)

We can verify that . When we employ the -power divergence function with , the estimator (8) is rewritten as

 ˆf(x|X∗) = M−1∑l=0ql~ϕl(x|X∗). (9)

Since (9) is a convex combination of the words , KDE is a sort of an estimator in the form of (8).

###### Remark 1

. The integral of the estimator is not always 1. Hence, we may consider its normalized form

 ˆfc(x|X∗) = γ−1ˆf(x|X∗),   γ=γ(X∗)=∫Rdˆf(x|X∗)dx.
###### Remark 2

.

The proportion of the dictionary data points in the total sample size influences the performance of the density estimation, and parameter serves as a kind of smoothing parameter. Letting the problem of optimizing aside, we assume parameter is given before starting the algorithm.

## 4 Theoretical results

We show the theoretical results of our proposed estimator. The main result is to show the non-asymptotic error bound of the estimator in Theorem 1. We also show the non-asymptoric error bound of the normalized version of the proposed estimator in Theorem 2.

In the theorems, we use the following notations. Let be the set of convex hull composed by ,

 co(ξ(D)) = {|D|∑i=1λiξ(ϕi(⋅|X∗))∣∣∣ϕi(⋅|X∗)∈D,|D|∑i=1λi=1,λi≥0}.

We consider a triplet

 Φ=(T∑m=1qmξ(~ϕm),ϕ,¯ϕ),T∑m=1qmξ(~ϕm)∈co(ξ(D)),ϕ,¯ϕ∈D.

The set of these triplets is denoted by

 H(D)≡co(ξ(D))×D×D={Φ∣∣∣T∑m=1qmξ(~ϕm)∈co(ξ(D)),ϕ,¯ϕ∈D}.

For [0, 1] and , we define

 ψU(δ,Φ|X∗) = ∫RdU′′((1−δ)T∑m=1qmξ(~ϕm(x|X∗))+δξ(ϕ(x|X∗))){ξ(ϕ(x|X∗))−ξ(¯ϕ(x|X∗)}2dx.

### 4.1 The non-asymptotic error bound of the estimator

To obtain the non-asymptotic error bound of the estimator, we use Assumption 1 as follows.

###### Assumption 1

.
(i) The convex function is twice differentiable.
(ii) There exists a constant such that

 supδ∈[0,1]supΦ∈H(D)ψU(δ,Φ|X∗) ≤ BU(X∗)2,almost surely.

Example 1. (The case of KL divergence)
If we introduce a constant defined as

 BKL(X∗)2=supϕ,¯ϕ,~ϕ∈D∫Rd~ϕ(x|X∗){logϕ(x|X∗)−log¯ϕ(x|X∗)}2dx,

in the case where KL divergence is employed for evaluating the algorithm, we see that for any and any . The proof is provided in Appendix C. We also evaluate the constant and derive its upper bound (18) in Appendix C, employing Gaussian densities with scalar bandwidth matrix in the dictionary. If we consider the upper bound (18) in the case of KL divergence, Assumption 1 is justified.

Then, we obtain Theorem 1. The proof is given in Appendix A.

###### Theorem 1

. For the density estimator in (8), it holds under Assumption 1 that

 EX∗EX+[DU(f(⋅),ˆf(⋅|X∗))] (10) ≤EX∗[infg∈co(ξ(D))DU(f(⋅),u(g(⋅|X∗)))]+2EX∗[EX+[supϕ∈D|νn(ξ(ϕ(⋅|X∗)))]] +θ2M+(θ−1)EX∗[BU(X∗)2]+ϵ,

where is the centered operator,

 νn(ξ(ϕ(⋅|X∗)))=1nn∑i=1ξ(ϕ(X+i|X∗))−∫Rdξ(ϕ(x|X∗))f(x)dx.

The symbol represents the expectation taken with respect to the sample used for the dictionary, , whereas does the one used for the algorithm, . The error bound in (10) diminishes as increases.

###### Remark 3

.

In the right-hand side of (10), the expected value

appears. In the case of KL divergence (Example 1), it suffices that the fourth moment of

exists to ensure the finiteness of . See Appendix D in detail.

### 4.2 The error bound of the normalized form of the estimator

To obtain the non-asymptotic error bound of the normalized form of the estimator, we use Assumption 2.

###### Assumption 2

. There exist two constants and such that

for any and for any with , where is the normalized form of .

Then, we obtain Theorem 2. The proof is given in Appendix B.

###### Theorem 2

. For the normalized form of , it follows from Assumptions 1 and 2 that

 EX∗[EX+[DU(f(⋅),ˆfc(⋅|X∗))]] ≤EX∗[EX+[DU(f(⋅),ˆf(⋅|X∗))]] where  v^f(X∗)=∫Rdˆf(x|X∗)dx.
###### Remark 4

.

Theorem 2 reveals that the bound for the normalized estimator corresponds to that for given in Theorem 1 along with an extra term.

###### Remark 5

.

We obtain , when the power divergence with is employed. In such a situation, the result of Theorem 2 coincides with that of Theorem 1.

## 5 Applications

### 5.1 Practical setting

For the sake of practical use, we consider the dictionaries 1 and 2, which are denoted as and , respectively. In dictionary 1, we use the following set of scalar bandwidth matrices:

 B1={h2I2∣∣∣h=ˆh⋅(mj)16,ˆh=√ˆhDPI,11ˆhDPI,22,j=1,2,...,5}, (11)

where and are the DPI estimators in Duong and Hazelton (2003) of the bivariate diagonal bandwidth matrix, , calculated by the dictionary data , . To obtain and , we employ Hpi.diag function in ’ks’ library in R. The bandwidth that should be used for is larger in size than , which is calculated by the number of data points, because the resulting estimator entails the convex combination of not more than kernel functions. In this sense, each word in is augmented by multiplying by the factor .

In dictionary 2, we consider the following set of scalar bandwidth matrices:

 B2 = {Hj=h2jI2 ∣∣∣hj=[SD(X∗1)SD(X∗2)]12⋅(21+η(j−1))16,  j=1,2,⋯,10}, (12)

where

is the standard deviation of

, and . Parameter is a tuning parameter, determined according to the sample size and/or the curvature of the true functions. We normally set , but we set in estimating Type J. If we assume parameter to be an increasing function of the sample size, then in (12

) is similar to the geometric mean of

, which is Scott’s rule in (Scott 2017, p.164).

### 5.2 Simulation

We consider simulations 1 and 2 for the dictionaries and respectively. In each simulation case, we examine the behaviors of the proposed density estimator in terms of Mean Integrated Squared Error (MISE) when the proportion of the dictionary data points in the total sample size changes. We design the following five simulation cases for that purpose:

1. , .

2. , ; however , .

3. , .

4. , .

5. , .

Cases (a), (b), and (c) examine the impact of the ratio to the behaviors of the proposed density estimators. Cases (d) and (e) are designed for comparison. In case (d), half of the original i.i.d. sample , , is discarded and the remainder , , is used for both the dictionary data and the algorithm data . In the case (e), the original i.i.d. sample , , is used in common for the dictionary and the algorithm.

In each simulation case, we use the bivariate simulation settings of Wand and Jones (1993), Type C, J and L, whose contour plots are shown in Figure 1. For each simulation setting, we generate a sample of size ; we retain one part of it for the dictionary and use the remainder for calculating empirical loss, and run the algorithm. We repeat this process 10 times and obtain MISE by averaging the ISEs calculated for each process. We consider three alternatives to our estimator, KDE1 and KDE2 with Duong and Hazelton’s (2003) DPI full bandwidth matrix and DPI diagonal bandwidth matrix, respectively, as well as RSDE. For the divergence function, we employ the power divergence function in (5) and set the tuning parameter to be and . We denote our estimators minimizing the -power divergence with and to be and , respectively. For the parameter in the mixing coefficient in (7), we set , following Klemelä (2007). The total iterations of the algorithm are .

#### 5.2.1 Simulation 1

We present the numerical results of and respectively in Tables 1 and 2, respectively, in terms of MISE. The visual presentation of the results for in Type L and for in Types C, J and L is given in Figure 2. We visually present the results of (b) in simulation 1 for Type C in Figure 3. In the figure, the two upper panels represent the plot of MISEs for every . The middle and bottom panels of each figure are the contour plots of the estimators. The red points in the contour plots are the data points used for the dictionary, while the blue ones are those chosen for estimation by the algorithm. The number of blue points is less than because the algorithm chooses the same data points more than once.

We see the following findings of in terms of MISE. In the case of Type C, we observe the cases (a) and (e) for and can outperform KDE1, KDE2, and RSDE; the cases (b) and (d) for can do those (see Table 2 and the two panels in the second column of Figure 2). This result is important in that our estimator can be superior to the three alternatives with the help of DPI bandwidth matrix estimator. In the case of Type J, we observe the case (e) for can outperform RSDE; the case (e) for can do KDE2 and RSDE (see Table 2 and the two panels in the third column of Figure 2). In the case of Type L, we observe the case (e) for and the cases (b) and (d) for can outperform RSDE; the case (e) for can do KDE2 and RSDE (see Table 2 and the rightmost two panels in Figure 2). The reason why Type C performs better than Type J and L is that the true function of Type C is a symmetric and is compatible with a scalar bandwidth matrix. We also observe a general trend that the case (e) performs better than (a)-(d) in terms of MISE except for in the case of Type L (see Table 1 and the leftmost two panels in Figure 2).

We compare our estimators with RSDE by the degree of data condensation. In Table 3, we show the data condensation ratios of our estimators and RSDE. In the columns of RSDE and (I), we show the ratios of the actual data points used for estimating the density function in the number of total data points . In columns (II), we show the ratios of the actual number of words in used for the estimations in the number of total words . We observe four results. First, our method yields lower data condensation ratios in terms of (I) and (II) than RSDE in all situations. Second, we observe that the case (a) yields the smallest data condensation ratios (I) and (II) in all situations. Third, the ratio (II) decreases as increases. Fourth, the ratios (I) and (II) in the case of are greater than those of in each simulation setting. The case of uses less data points and words for estimation than that of .

#### 5.2.2 Simulation 2

We present the numerical results of and in Tables 4 and 5, respectively. The visual presentation of the tables is given in Figure 4 for Type C. We observe two general features from the results in Tables 4 and 5. One is that (c) (d) (a) in terms of MISE. (Compare each number in Table 4 with its counterpart in Table 5. In the case of Type C of for , observe the two line graphs of (a) and (d) in the upper right panel of Figure 4). This indicates that (d) lies between the best and the worst cases of the proposed estimators in terms of MISE. The second is that (c) (b) (a) in terms of MISE. (Compare each number in Table 4 with its counterpart in Table 5. See also the line graphs (a), (b), and (c) in each panel of Figure 4 to cite an example of Type C.) This indicates that MISE can be improved as the percentage of the dictionary data points in the sample size decreases. However, it should be noted that the improvement of MISE caused by a decrease in the ratio of occurs as far as the number of the dictionary data points is not too small, because the algorithm can no longer be executable in such a situation. Our experiments suggest that MISE can deteriorate when is lesser than . The reason why the two features, (c) (d) (a) and (c) (b) (a) in terms of MISE, are not observed in simulation 1 is that each word in (11) has larger inter-sample variance than that in (12). (Compare each SD of MISE in Tables 1 and 2 with its counterpart in Tables 4 and 5, respectively.)

In the same manner of Figure 3, we visually present the results of (b) in simulation 2 for Types C, J, and L in Figures 5,  6, and 7, respectively. We find and outperform KDE1, KDE2, and RSDE in terms of MISE as increases in Type C. (See the upper two panels of Figure 5.) In comparison with simulation 1, simulation 2 yields the smaller MISE for Types C and L. (Compare each number in Tables 1 and 2 with its counterpart in Tables 4 and 5, respectively.) Observing the contour plots in the same figures, we find in Types J and L that captures the shape of the true contour plot rather better than . (See the middle and bottom panels of Figures 6 and 7.) We consider this difference could be the result of the robustness property of the -power divergence function in the case of .

We describe the data points chosen by our method for estimation from the dictionary. From the contour plots of in the lower two panels of Figure 6, we find that our algorithm generally chooses data points along with the mountain ridges of the contour plots. This tendency is also observed in RSDE (see Girolami and He 2003, p.1256).