1 Introduction
Suppose that , , are dimensional i.i.d. sample data generated from the true density finction on . We estimate by the multivariate kernel density estimator (KDE); its general representation is written as
(1) 
where , , is a symmetric and positive definite dimensional bandwidth matrix used for the data , is a nonnegative real valued bounded kernel function, and , , are the weighting parameters assigned for the data . Our aim is to obtain the sparse representation of , which involves estimating by choosing the subset of data points among the given data points , , while simultaneously minimizing the estimation error to the greatest extent possible.
Girolami and He (2003) propose the reduced set density estimator (RSDE), which optimizes the weighting parameters , , in terms of the integrated squared error (ISE) under the constraint , , , assuming the bandwidth matrix is a class of scalar bandwidth matrices with its size of as given. Due to the structure of the optimization problem, its outcome allows for some
’s, realizing the sparse representation of KDE. RSDE is further used for outlier detection and classification in He and Girolami (2004).
In addition to RSDE, Nishida and Naito (2021), associated with Klemelä (2007) and Naito and Eguchi (2013), propose the kernel density estimation using stagewise minimization algorithm (SMA) for the purpose. The SMA algorithm first requires a dictionary that consists of kernel functions with scalar bandwidth matrices, where the original i.i.d. sample is randomly split into two disjoint sets; one is used for the means of the kernels and calculating the bandwidths in the dictionary, and the other is used for evaluating the resulting density estimator via an evaluation criterion. Subsequently, the SMA algorithm proceeds in a stagewise manner, choosing a new kernel from the dictionary at each stage to minimize the evaluation criterion of the convex combination of the new kernel and the estimator obtained in the previous stage. Both of these are transformed using an appropriate function before processing the convex combination. Then, the resulting convex combination backtransformed by the function results in the density estimator at the present stage, realizing the sparse representation of KDE. For the evaluation criterion, divergence is employed, which is similar to the Bregman divergence in Bregman (1967) and Zhang et al. (2009, 2010). This algorithm sorts out the data points used for the density estimator in a stagewise manner improving the evaluation criterion and realizes a sparse representation of KDE when it reaches the final stage. Simulation studies in Nishida and Naito (2021) confirm that the SMA estimator competes with or sometimes performs better than the RSDE estimator in terms of mean integrated squared error (MISE). Simultaneously, SMA outperforms the RSDE in terms of data condensation ratio (DCR), which refers to the ratio of the number of distinct data points used for the estimator to the original sample size . Both methods exhibit a good performance in terms of data condensation, but the number of data points used in the sparse representation cannot be adjusted by users. In addition, the SMA faces a problem. It updates the estimator stage by stage, taking the choices of words up to the present stage as given, which does not necessarily optimizes its evaluation value among all the possible convex combinations of words.
In this study, we propose a new method using genetic algorithm (GA) for KDE in data condensation. A GA (e.g. Goldberg and Holland 1988; Holland 1989; Forrest 1993; Haupt and Haupt 2004; Sivanandam and Deepa 2008) is a stochastic searching algorithm inspired by natural selection process, such as inheritance, selection, mutation, crossover and reproduction. We find many applications of GA to statistical methods, including Duczmal et al. (2007) for spatial scan statistics, Koukouvinos(2007) for exploring circulant supersaturated designs, Vovan et al. (2021) for clustering discrete elements, and Brito et al. (2020) for stratified sampling. However, we have not found its application to KDE so far.
We present an outline of our GA for KDE here. Suppose that we have an original sample of size . From this original sample, we make a subsample of size and repeat this process
times in total with replacement. In our GA, we regard each subsample as vector of data points, where the vector and each of its element are regarded as
and , respectively, in the terminology of GA. Then, we obtain the initial population of chromosomes of population size . In the subsequent generation, we evaluate each inherited chromosome by calculating of the KDE using the cromsome to the true density and sort the chromosomes in the descending sequence of fitness values. The number of upper chromosomes in the sequence are taken over to the next generation by the elite selection rule. At the same time, we make pairs of chromosomes in such a way that two adjoining chromosomes in the sequence are coupled exhaustively and are mutually exclusive. Then, each pair of chromosomes breed two new chromosomes, where each gene in the chromosome is faced with either crossover, mutation, or reproduction with a certain probability. The new chromosomes are sorted again by the updated fitness values and the number of upper of the chromosomes in the sequence are taken over to the next generation. This process is repeated generation by generation until it reaches the final generation. At its completion, we obtain the best fitted subsample and bandwidth matrix for KDE, which have survived all through the selection process. The algorithm is considered to be a solution for combinational optimization problems to extract the subset of data points from the original sample and use the subset to construct the KDE and yield the best fitness value. Generally, a GA is exertive in finding heuristic solutions of a combinational optimization problem.
2 The proposed GA
In this section, we describe our proposed GA for KDE. Let be a fitness function that measures the discrepancy between the true density function and the KDE in (1) which uses the sample and the universal bandwidth matrix .
Step 1: Initial generation :

Define the size of subsample , number of subsamples , final generation number , and the weighting parameters .

Make the number of subsamples of size with replacement from the original sample of size , where is an even number. Each subsample is called chromosome and is denoted as , , where , , is the th data point of the th subsample called gene. The population in the generation 1 is written as .
Step 2: The generations :

Inherite the population from the previous generation .

For each subsample , , calculate the fitness value along with the optimal bandwidth matrix . Then, sort the elements in in the descending order of their fitness values , , and rename the resulting sequence as .

Make the replica .

Breed two new subsamples using the pair of subsamples and , ; each pair of data points and , , faces either of the following with a certain probability.

Mutation : With mutation probability , and are replaced with the two data points randomly chosen from .

Uniform crossover : is swapped for with crossover probability .

Reproduction : and remain unchanged with probability .


For each renewed subsample , for , calculate the fitness value along with the optimal bandwidth matrix . Then, sort the renewed subsamples in the descending order of their renewed fitness values, and rename the resulting sequence as .

The renewed population is taken over to the generation , where is the ratio of the number of subsamples in inherited to the next generation by the elite selection rule.
Step 3: Completion of the algorithm at :

Repeat Step 2 stagewise until reaches . Then, accept the KDE exhibiting the best fitness value written as
along with the resulting subsample and bandwidth matrix .
We present the flowchart in Figure 1.
Remark 1
.
As an empirical rule regarding the choice of and , it is known that setting is desirable to achieve convergence (Goldberg 1989). If is too small, the algorithm tends to yield local minimum solutions; otherwise, it takes a significant amount of time to reach convergence.
Remark 2
.
At the phase of mutation in Step 24(i), data points can be randomly replaced with the ones in the set of the original sample instead of the ones in . In such a situation, diversity of the results is retained.
Remark 3
.
We define
where is an indicator function that designates unity when the statement in the bracket is true. Then, the resulting KDE is written as
with the dataadaptive weighting parameters , , where the algorithm adjusts , , to minimize the fitness value, realizing the sparse representation of KDE.
Fitness function: We naturally extend the idea of the leastsquares crossvalidation method in Rudemo(1982) and Bowman(1984) to our fitness function. Let denote KDE using the subsample and the bandwidth . We consider ISE between and written as
(2)  
Replacing the second term in (2) by its empirical form, we obtain the fitness function
(3)  
where the third term in (2) is excluded because it is not involved in optimization with respect to . We note that the second term in (3) is the mean of over test data while training data are left out. If we employ the training data for the averaging in the second term of (3), the resulting fitness value is degenerated because the distribution of the training data through our GA is not necessarily identical to that of the original data to be presented in simulation sections. We employ (3) as the fitness function to evaluate the estimator and use it to find the optimal bandwidth matrix.
3 Simulation studies
We conduct Simulations 13 to validate performance of our proposed GA. Simulations 1 and 2 are MonteCarlo simulations dealing with the bivariate and trivariate settings respectively. Simulation 3 is a real data application of our method. We assume the bandwidth matrix in the simulations as a class of scalar matrices and the weighting parameters are set to be . The competitors to our GA are the Direct Plugin method in Duong and Hazelton (2003) in the setting of full bandwidth matrix (henceforce, DPI) and the RSDE. In calculating DPI, we employ Hpi function in ’ks’ library in R. For kernel, we employ Gaussian throughout the study.
In simulations 1 and 2, we generate a dataset of size from and obtain different kernel density estimators by applying our GA times for the dataset. Then, we define the performance measure to be the average of the different ISE’s computed by each estimator; we denote the measure to be ISE or ISE, the latter of which represents the ISE at the generation . Introducing ISE for performance measure is inevitable because GA is a stochastic search algorithm. Each simulation is categorized by the different sizes of because we expect the performance of our GA to be concerned with the size of . In evaluating the performance of DPI and RSDE, we generate the number of datasets of size and compute MISE by averaging ISE’s computed by each dataset, where one of the datasets is identical to the one used for computing the ISE. The simulations are developed in terms of the degree of data condensation as well because we expect the influential data points to the estimation error to survive the selection process of our GA.
3.1 Simulation 1 (bivariate)
In Simulation 1, we employ the bivariate simulation settings Type C and L in Wand and Jones (1993) given below, where the notation represents the bivariate normal density with means ,
, , and correlation coefficient , whose contourplots are drawn in Figure 2.Type C. Skewed:
Type L. Quadrimodal:
We employ the following GA parameter settings and consider three sample sizes, , and . The numerical results in terms of estimation error for Types C and L are presented in Tables 1 and 3 respectively. The numbers DPI and RSDE are the values of ISE calculated by the DPI and the RSDE respectively using the identical sample in calculating ISE by our GA. The minimum values of ISE at each generation over the sizes of
are underlined in the Tables. All the parenthetic numbers in the tables represent standard deviations (S.D.) throughout the study.
Summary of the results : Type C
Figure 3 summarizes the results in Table 1 for the case of , where we pick up the case as one example because we observe the common tendencies regardless of . The panel (a) in Figure 3 visually summarizes the results in Table 1 with respect to the size of , representing the plot of the value of ISE at for each size of , along with the values of MISE for the DPI and the RSDE estimators. We observe that ISE takes the minimum value in the neighborhood of . It is because the two effects are balanced when is close to 25: one effect is that ISE is improved as the number of training data points increases and the other is that ISE degenerates as the number of test data points decreases. We also note that our GA can outperform the DPI estimator even in the case of when is greater than the vicinity of . We consider the five data points chosen by our GA are essential to constructing the density estimator.
The panel (b) in Figure 3 summarizes the result in Table 1 with respect to when . We observe that our proposed GA of can outperform the DPI estimator using the original sample of the size in terms of estimation error when is greater than the vicinity of 5. This feature is observed regardless of , excepting the case of .
The panels (c) and (d) in Figure 3 are respectively the contourplot and the perspectiveplot of our estimator, both of which comes from one shot of 10 implementations of our GA to a sample, where the red points in the contourplot are the original sample of the size , while the blue ones are those chosen by our GA in the estimation. We observe that our GA captures the shape of the true density. We also observe that our GA tends to choose the data points close to the peak of the true density, comparing the distribution of the blue points with that of the red ones in the panel (c).
Looking at the values of the cases , and in Table 1, we notice that ISE is minimized in the vicinity of over . We consider it to be so because the data points without good influence can be included in the subsample as progresses when the variety of the training data points is rich.
The numerical results in terms of data condensation for Type C are presented in Table 2. The numbers in column (I) represent the number of distinct data points used for estimating the density on the average over implementations of our GA to a sample. The numbers in column (II) represent the maximum multiplicity of the data points on the average, where the multiplicity is defined to be the number of times the single data point is chosen by the GA. The DCR by our GA is defined to be (I) divided by on the average over implementations of our GA to a sample. In contrast, the DCR by RSDE is the average of DCR over different datasets, in which the identical dataset used in calculating the DCR by our GA is included.
We observe that our estimator can outperform RSDE in terms of the DCR and yields the smaller DCR as the size becomes smaller for each sample size . We also observe that the value of (II), which means the maximum weight of data points used for the estimation, increases as the size of becomes small. It means that our GA chooses the data points with great influence on estimation, many times as becomes small. If we compare the DCR over different sample sizes for each size of , we observe the smaller DCR for the larger sample size .
We are also interested in the performance of the estimator by our GA over different original datasets. For reference, we calculate the MISE of our proposed estimator using the identical original datasets for calculating the MISE’s of the DPI and the RSDE in Table 1. We calculate MISE in such a way that we implement our GA once for each of 10 datasets until and calculate the average of all the resulting ISEs. In the case of , the best performance case in terms of ISE of the simulation, MISE (S.D. ) comes out to be (44), outperforming its competitors. The corresponding DCR (S.D.) is 0.0154 (0.0034). The superiority of our GA over different original samples to its competitors is widely observed in the rest of the cases as well.
1  25  50  75  100  DPI  RSDE  DPI  RSDE  

—  —  —  —  —  711 (198)  1503 (616)  578  972  
1855 (253)  1485 (190)  1485 (190)  1485 (190)  1485 (190)  

1014 (190)  296 (84)  295 (83)  295 (83)  295 (83)  
736 (295)  211 (36)  197 (31)  186 (27)  198 (33)  
568 (121)  223 (40)  215 (57)  210 (59)  215 (37)  
587 (136)  266 (104)  270 (62)  289 (61)  277 (45)  
448 (114)  268 (98)  291 (64)  294 (98)  361 (95)  
—  —  —  —  —  435 (83)  1231 (347)  413  842  
1523 (375)  1173 (143)  1173 (143)  1173 (143)  1173 (143)  

894 (362)  347 (59)  327 (63)  327 (63)  327 (63)  
742 (249)  309 (38)  309 (67)  300 (31)  293 (31)  
606 (173)  316 (58)  312 (51)  325 (46)  312 (31)  
427 (209)  322 (37)  359 (61)  327 (39)  352 (43)  
473 (147)  319 (48)  360 (47)  372 (68)  385 (58)  
—  —  —  —  —  253 (52)  643 (190)  220  842  
1576 (286)  1276 (64)  1275 (64)  1275 (64)  1275 (64)  

860 (304)  237 (63)  224 (56)  224 (56)  224 (56)  
630 (101)  112 (35)  99 (30)  78 (25)  77 (24)  
466 (140)  109 (39)  79 (24)  82 (29)  86 (23)  
373 (123)  114 (28)  109 (16)  115 (30)  112 (22)  
368 (62)  133 (32)  137 (39)  138 (27)  148 (42) 
(I)  (II)  (II)  DCR.GA  DCR.RSDE  

—  —  —  —  .2650 (.0291)  
11.80 (1.75)  5.30 (0.94)  0.21  .0590 (.0088)  —  
19.40 (1.50)  7.40 (1.26)  0.14  .0970 (.0075)  —  
32.60 (2.98)  12.20 (4.02)  0.12  .1630 (.0149)  —  
42.90 (4.79)  13.80 (1.81)  0.09  .2145 (.0240)  —  
—  —  —  —  .2047 (.0158)  
14.40 (2.27)  5.30 (2.11)  0.21  .0360 (.0057)  —  
26.80 (4.61)  6.20 (2.34)  0.12  .0670 (.0115)  —  
44.10 (4.58)  7.80 (2.14)  0.07  .1103 (.0115)  —  
56.30 (4.85)  10.40 (2.87)  0.06  .1408 (.0121)  —  
—  —  —  —  .1524 (.0155)  
15.80 (1.54)  3.60 (0.69)  0.14  .0158 (.0015)  —  
29.60 (5.03)  5.40 (1.83)  0.10  .0296 (.0050)  —  
56.40 (5.27)  5.70 (1.05)  0.05  .0564 (.0053)  —  
78.30 (8.94)  7.60 (1.17)  0.05  .0783 (.0089)  — 
Summary of the results : Type L
Figure 4 summarizes the results in Table 3 for the case of as an example, the counterpart of Figure 3. From the panel (a) in Figure 4, we observe that ISE takes the minimum value from to . We assume that ISE would increase when exceeds 300 and approaches 1000. When is greater than the vicinity of , our GA can outperform DPI estimator with its sample size in terms of estimation error. The panel (b) in Figure 4 plots the value of ISE at every generation in the case of . We observe that our proposed GA method of can outperform DPI estimator using the original sample of the size in terms of estimation error when is greater than the vicinity of 20. The contourplot in (c) of Figure 4 captures the shape of true density function and exhibits similar tendencies to the results in the case of Type C. From the results in Table 3, we observe our GA can outperform DPI estimator in Type L only when , , and . It seems that Type L is more difficult in making estimators using our GA than Type C because Type L requires a larger size of and than Type C to achieve a smaller estimation error than DPI.
Table 4 is the results in terms of the DCR for Type L, being the counterpart of Table 2. We confirm the similar tendencies to Type C and our GA can yield the smaller DCR than RSDE while simultaneously yielding smaller estimation errors.
We calculate MISE of our estimator in the same manner as Type C. In the case of , MISE (S.D. ) comes out to be (45), outperforming its competitors. The corresponding DCR (S.D.) is 0.0819 (0.0051).
1  25  50  75  100  DPI  RSDE  DPI  RSDE  

—  —  —  —  —  734 (141)  1299 (524)  613  1122  
2254 (399)  1628 (315)  1628 (315)  1628 (315)  1628 (315)  

1672 (401)  1372 (101)  1378 (102)  1378 (102)  1378 (102)  
1151 (240)  974 (122)  917 (129)  887 (109)  922 (81)  
1179 (263)  905 (114)  880 (119)  881 (91)  846 (57)  
957 (243)  879 (130)  833 (86)  843 (81)  871 (69)  
845 (180)  853 (59)  867 (50)  858 (71)  841 (81)  
—  —  —  —  —  469 (73)  947 (284)  406  834  
1801 (250)  1443 (71)  1443 (71)  1443 (71)  1443 (71)  
1493 (297)  1470 (368)  1456 (357)  1456 (357)  1456 (357)  
1074 (195)  737 (101)  674 (117)  639 (123)  624 (101)  
798 (206)  636 (79)  584 (75)  598 (79)  566 (54)  
721 (149)  634 (79)  584 (77)  562 (65)  555 (38)  
615 (130)  578 (61)  553 (60)  562 (53)  527 (40)  
—  —  —  —  —  270 (40)  508 (103)  223  455  
1980 (342)  1486 (175)  1486 (175)  1486 (175)  1486 (175)  
1483 (213)  1065 (88)  1052 (79)  1052 (79)  1052 (79)  
845 (115)  433 (80)  328 (63)  325 (46)  307 (37)  
721 (80)  294 (45)  247 (31)  224 (36)  238 (43)  
531 (84)  263 (56)  226 (31)  222 (32)  206 (26)  
470 (81)  237 (38)  196 (46)  187 (20)  179 (13) 
(I)  (II)  (II)  DCR.GA  DCR.RSDE  

—  —  —  —  .2695 (.0326)  
14.60 (1.07)  3.60 (0.84)  0.14  .0730 (.0053)  —  
25.20 (2.82)  5.20 (0.91)  0.10  .1260 (.0141)  —  
36.70 (6.09)  8.80 (2.14)  0.08  .1835 (.0304)  —  
49.30 (3.83)  11.90 (3.44)  0.07  .2465 (.0191)  —  
—  —  —  —  .2125 (.0134)  
17.00 (1.94)  3.50 (1.08)  0.14  .0425 (.0048)  —  
31.50 (2.36)  4.50 (0.97)  0.09  .0787 (.0059)  —  
51.90 (5.54)  6.30 (1.63)  0.06  .1297 (.0138)  —  
68.70 (5.16)  8.10 (1.72)  0.05  .1717 (.0129)  —  
—  —  —  —  .1474 (.0103)  
17.50 (1.95)  2.90 (0.56)  0.11  .0175 (.0019)  —  
33.60 (2.41)  4.10 (1.28)  0.08  .0336 (.0024)  —  
59.30 (5.39)  4.80 (0.78)  0.04  .0593 (.0053)  —  
79.40 (7.64)  6.50 (1.43)  0.04  .0794 (.0076)  — 
3.2 Simulation 2 (trivariate)
In Simulation 2, we employ the following trivariate simulation setting for the true density, where the notation represents the trivariate normal density with means , , and , variances , , and , and correlation coefficients , , and .
Type trivariate. Trivariate skewed:
We implement the simulation in the same manner as Simulation 1 employing the same GA parameter settings . The numerical results are presented in Table 5, which are summarized in Figure 5 for the case of . From the panel (a) in Figure 5, we observe that ISE takes the minimum value in the vicinity of . When is greater than the vicinity of , our GA can outperform the DPI estimator with its sample size in terms of estimation error. The panel (b) in Figure 5 plots the value of ISE at every generation in the case of . We observe that our proposed GA method of can outperform the DPI estimator using the sample of the size in terms of estimation error when is greater than the vicinity of 8.
Comparing the results of ISE in Tables 1 and 5, the speed of convergence with respect to reduces as the dimension increases, except for the cases of and . Looking at the numeric results in the cases of and in Table 5, we also observe that ISE is minimized at while in the case of Type C bivariate, it is minimized at for the same size of . We conjecture that a smaller number of training data points is sufficient to minimize estimation error in the larger dimension.
The panel (a) in Figure 6 is the 3D contourplot of the result at the five density levels , , , and , which we draw using the bandwidth matrix , , calculated by our GA and slice at . The panels (b) (c) and (d) in the same Figure are the representations of the trivariate results by the 2D contourplots of vs , vs , and vs respectively, which we draw using the same bandwidth as that of (a). From the shape of the 2D contourplots, the three 2D marginal densities of Type C trivariate are the same shape by symmetry, and the estimation by our GA appears to be working to some degree.
Table 6 presents the results in terms of the DCR for Type C trivariate, being the counterpart of Tables 2 and 4. We confirm the similar tendencies to Type C and L in terms of DCR’s and our GA can yield the smaller DCR while simultaneously yielding smaller estimation error than RSDE. Observing the results in Table 2 and 6, Type C trivariate yields the larger DCR compared to Type C when and for each size of while both are comparable in the case of .
We calculate MISE of our estimator in the same manner as simulation 1. In the case of , MISE (S.D. ) comes out to be (71), outperforming its competitors. The corresponding DCR (S.D.) is 0.0157 (0.0020).
1  25  50  75  100  DPI  RSDE  DPI  RSDE  

—  —  —  —  —  1053 (252)  1408 (812)  1039  2031  
1341 (263)  795 (147)  795(147)  795(147)  795 (147)  

975 (382)  427 (63)  421 (53)  421 (53)  421 (53)  
815 (186)  440 (158)  482 (159)  561 (150)  567 (145)  
604 (159)  524 (87)  726 (151)  846 (111)  842 (157)  
678 (199)  738 (179)  1137 (243)  1312 (204)  1391 (196)  
741 (231)  895 (161)  1414 (139)  1446 (109)  1632 (114)  
—  —  —  —  —  665 (58)  1041 (248)  704  1267  
1078 (331)  715 (93)  715 (93)  715 (93)  715 (93)  

1067 (407)  369 (74)  362 (70)  362 (70)  362 (70)  
753 (134)  409 (34)  419 (55)  438 (66)  464 (61)  
648 (101)  411 (83)  470 (78)  517 (100)  511 (77)  
682 (151)  493 (62)  555 (91)  554 (67)  638 (37)  
520 (79)  508 (58)  600 (66)  630 (59)  681 (68)  
—  —  —  —  —  369 (55)  794 (177)  361  950  
1076 (229)  845 (124)  845 (124)  845 (124)  845 (124)  

846 (333)  251 (35)  259 (41)  259 (41)  259 (41)  
648 (151)  184 (34)  174 (33)  168 (25)  170 (23)  
535 (119)  177 (38)  177 (48)  167 (43)  176 (32)  
423 (104)  193 (60)  209 (36)  217 (27)  218 (24)  
382 (94)  201 (46)  220 (31)  243 (31)  254 (36) 
(I)  (II)  (II)  DCR.GA  DCR.RSDE  

—  —  —  —  .3775 (.0487)  
14.40 (2.41)  4.70 (0.82)  0.18  .0720 (.0120)  —  
26.70 (3.05)  6.00 (1.88)  0.12  .1335 (.0152)  —  
43.40 (3.59)  8.00 (2.10)  0.08  .2170 (.0179)  —  
55.50 (4.88)  11.10 (1.72)  0.07  .2775 (.0244)  —  
—  —  —  —  .3135 (.0298)  
14.50 (1.90)  4.50 (1.17)  0.18  .0362 (.0047)  —  
27.80 (3.42)  6.00 (1.24)  0.12  .0695 (.0085)  —  
49.00 (4.32)  8.00 (2.35)  0.08  .1225 (.0108)  —  
64.80 (6.62)  11.50 (2.67)  0.07  .1620 (.0165)  —  
—  —  —  —  .1857 (.0111)  
15.70 (1.82)  5.30 (1.63)  0.21  .0157 (.0018)  —  
30.90 (2.99)  5.60 (2.50)  0.11  .0309 (.0029)  —  
53.60 (5.77)  7.10 (1.72)  0.07  .0536 (.0057)  —  
73.80 (5.49)  9.00 (2.21)  0.06  .0738 (.0054)  — 
3.3 Simulation 3 (Real data application)
We show a real data application of trivariate density estimation for our GA. We use abalone data set
, originally from Nash et al. (1994), available in the UCI Machine Learning Repository. The dataset consists of the physical measurements of abalones collected in Tasmania, with each abalone being measured on eight attributes: Sex (male, female, infants), Length (mm), Diameter (mm), Height (mm), Whole weight (grams), Shucked weight (grams), Viscera weight (grams), Shell weight (grams), and Rings (integer). We construct the dataset consist of Diameter, Viscera weight and Shucked weight out of the eight attributes. The sample size
is for male abalones. In the estimation, we set . We implement our GA once for the trivariate abalone dataset and obtain , at the completion of our GA. In terms of data condensation, we obtain the following results; the number of different data points, 44; the maximum multiplicity, 6; and the DCR, 0.0288. For reference, RSDE brings DCR=0.1865.The panel (a) in Figure 7 is the 3D contourplot of the result at the five density levels , , , and . The panels (b) (c) and (d) in the same Figure are the representations of the trivariate results by the 2D contourplots of Diameter vs Viscera weight, Diameter vs Shucked weight, and Viscera weight vs Shucked weight respectively, which we draw by using the same bandwidth matrix as that of (a). The red points in the 2D contourplots designate the original data points, while the blue ones are chosen by our GA for the estimation. From the shape of the 2D contourplots, the estimation by our GA appears to be working. From the 2D contourplots (b), (c) and (d) tell us that our GA generally chooses data points along with the mountain ridges of the contourplots. This tendency is also observed in RSDE (Girolami and He 2003, p.1256) and SMA (Nishida and Naito 2021).
The panels (a) (b) (c) and (d) in Figure 8 are the plot of the (minus) fitness values in (3) at each generation , the plot of the bandwidths at each generation , the histgram of the multiplicity of each data point, and the frequency plots of the data points selected by our GA respectively. In the panel (d), the data points are indexed in the ascending order of distance from the origin to illustrate the spatial distribution of weights. The panel (a) tells us our GA reduces the (minus) fitness value as progresses. The panel (b) tells us that the bandwidth is no longer updated when is greater than 300. The panel (c) tells us that the only data point (Diameter, Viscera weight, Shucked weight)
is chosen 6 times. The panel (d) tells us that greater weights are placed on those in the vicinity of the data point indexed as 25. It also tells us that the data points with unit weight are uniformly distributed over the indices.
4 Discussion
In this study, we propose a data condensation method for KDE by GA. In the initial generation of our GA, we first construct multiple subsamples of a given size with replacement from the original sample, where each subsample and each of the constituting data point is called a chromosome and gene, respectively. In the subsequent generation, we evaluate the chromosomes in terms of fitness, and some chromosomes are inherited to the next generation by the elite selection rule. In line with elite selection, a pair of chromosomes, paired in order of the best fitness value, breed two new chromosomes by crossover, mutation, and reproduction and some of the new chromosomes, whose recalculate fitness values predominate those of the rest, are also inherited to the next generation in such a way that the total number of chromosomes inherited to the next remains unchanged over generation. This process is repeated generation by generation until the terminating condition is fulfilled and we finally obtain the KDE using the best subsample along with the best smoothing parameter.
We validate the performance of our GA by simulations and confirm that it can yield the KDE better than DPI and RSDE in terms of estimation error and DCR in many situations. In addition to the simulation studies, we also conduct the sensitivity analysis of the tuning parameters , and to the performances of the resulting density estimator although we omit presenting the detailed results in this study for its brevity. We confirm that the sizes of and are influential to the results; if improperly selected, the speed of convergence becomes slow. Hence, we follow the practice adopted in the literature in this study (see Remark 1). We also confirm that the impact of the size of is nonsignificant unless is set to be miniscule. As for the size of , if the size of is miniscule, the diversity of the chromosomes is reduced. We confirm that ranging from to is proper for the original sample sizes given in our simulation studies.
Some variants of our GA are possible, such as employing point crossover in the stage of crossover or performing roulette selection, tournament selection, and ranking selection in the stage of selection (e.g., Haupt and Haupt 2004; Sivanandam and Deepa 2008). Our supplemental simulation studies show that these variants are not better than our proposed GA method in terms of the speed of convergence; hence, we do not present the detailed results.
This study highlights three important issues associated with our GA. First, it is noteworthy that our GA can be superior to DPI in terms of estimation error even though our method employs a scalar bandwidth matrix, that is, the simpler matrix form. Second, our simulation results tell us how we determine the size of , which is equivalent to how we choose a combination of the data points of the size from the original sample, that plays a role of smoothing parameter as well as the wellknown smoothing parameters such as bandwidth and weighting parameter. Our study shows that the optimal is determined by the balance of effects between the training and test data points in their numbers. Third, our GA is similar to RSDE in that both the methods ultimately aim to find the optimal weighting parameters assigned to each data point (see Remark 3). In the case of RSDE, it is required to compute the initial bandwidth matrix , which is calculated in this study by the least squared crossvalidation method under the condition following Girolami and He (2003). This initial bandwidth matrix does not necessarily generate the best performance of estimation error for RSDE among the possible combinations of because the optimization of in the initial stage does not anticipate optimizing . Better initial bandwidth matrices would probably exist, although they cannot be found. In contrast, our GA is different in that our method can update the optimal bandwidth at every stage. It also looks like our GA performs significantly better than RSDE in terms of DCR.
For further studies, we will work toward finding the optimal size of subsamples and apply our GA to kernel regression estimation.
Acknowledgements
The author gratefully acknowledges the financial support from KAKENHI 19K11851.
References
 [1] [] Bowman, A. (1984). An alternative method of crossvalidation for the smoothing of density estimates. Biometrika, 71(2), pp.353360.
 [2] [] Bregman, L.M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming, USSR computational mathematics and mathematical physics, 7(3), pp.200217.
 [3] [] Brito, J., Fadel, A. and Semaan, G. (2020). A genetic algorithm applied to optimal allocation in stratified sampling, Communications in statistics  simulation and computation, forthcoming. DOI:10.1080/03610918.2020.1722832
 [4] [] Duczmal, L., Cancado, A.L.F., Takahashi, R.H.C., and Bessegato, L.F. (2007). A genetic algorithm for irregularly shaped spatial scan statistics, Computational statistics and data analysis, 52(1), pp.4352.
 [5] [] Duong, T., and Hazelton, M.L. (2003). Plugin bandwidth matrices for bivariate kernel density estimation, Journal of nonparametric statistics, 15(1), pp.1730.
 [6] [] Forrest, S. (1993). Genetic Algorithms: principles of natural selection applied to computation, Science, 261(5123), pp.872878.
 [7] [] Girolami, M., and He, C. (2003). Probability density estimation from optimally condensed data samples, IEEE transactions on pattern analysis and machine intelligence, 25(10), pp.12531264.
 [8] [] Goldberg, D.E., and Holland, J.H. (1988). Genetic algorithms and machine learning, Machine learning, 3, pp.9599.
 [9] [] Goldberg, D.E. (1989). Genetic algorithms in search, optimization, and machine learning. AddisonWesley.
 [10] [] Haupt, R.L., and Haupt, S.E.(2004). Practical genetic algorithms, Second edition. Wiley.

[11]
[] He, C., and Girolami, M.(2004). Novelty detection employing an L2 optimal nonparametric density estimator,
Pattern recognition letters, 25(12), pp.13891397.  [12] [] Holland, J.H. (1992). Genetic algorithms, Scientific American, 267(1), pp.6672.
 [13] [] Klemelä, J. (2007). Density estimation with stagewise optimization of the empirical risk, Machine learning, 67(3), pp.169195.
 [14] [] Koukouvinos, C., Mylona, K., and Simos, D.E. (2007). Exploring kcirculant supersaturated designs via genetic algorithms, Computational statistics and data analysis, 51(6), pp.29582968.
 [15] [] Naito, K., and Eguchi, S. (2013). Density estimation with minimization of divergence, Machine learning, 90(1), pp.2957.
 [16] [] Nash, W.J., Sellers, T.L., Talbot, S.R., Cawthorn, A.J., and Ford, W.B. (1994). The population biology of abalone (haliotis species) in Tasmania. I. blacklip abalone (H. rubra) from the north coast and islands of bass Strait, Sea fisheries division, technical report No. 48 (ISSN 10343288).
 [17] [] Nishida, K., and Naito, K. (2021). Kernel density estimation by stagewise algorithm with a simple dictionary, arXiv, 2107.13430, Cornell university.
 [18] [] Rudemo, M. (1982) Empirical choice of histograms and kernel density estimators. Scandinavian journal of statistics, 9, pp.6578.
 [19] [] Sivanandam, S.N., and Deepa, S.N. (2008). Introduction to genetic algorithms, Springer.
 [20] [] Vovan, T., Phamtoan, D., and Tranthituy, D. (2021). Automatic genetic algorithm in clustering for discrete elements, Communications in statistics  simulation and computation, 50(6), pp.16791694.
 [21] [] Wand, M.P., and Jones, M.C. (1993). Comparison of smoothing parametrizations in bivariate kernel density estimation, Journal of the American statistical association, 88(422), pp.520528.
 [22] [] Zhang, C., Jiang, Y., and Chai, Y. (2010). Penalized bregman divergence for largedimensional regression and classification, Biometrika, 97(3), pp.551566.
 [23] [] Zhang, C., Jiang, Y., and Shang, Z. (2009). New aspects of bregman divergence in regression and classification with parametric and nonparametric estimation, Canadian journal of statistics, 37(1), pp.119139.