1 Introduction
Density estimation is widely used in a variety of fields in order to study the data and extract informations on variables whose distribution is unknown. Due to its simplicity of use and interpretation, kernel density estimation is one of the most commonly used density estimation procedure. Of course we do not pretend that it is ”the” method to be used in any case but that being said, if one wants to use it in a proper way, one has to take into account that its performance is conditioned by the choice of an adapted bandwidth. From a theoretical perspective, once some loss function is given an ideal bandwidth should minimize the loss (or the expectation of the loss) between the kernel density estimator and the unknown density function. Since these ”oracle” choices do not make sense in practice, statistical bandwidth selection methods consist of mimiking the oracle through the minimization of some criteria that depend only on the data. Because it is easy to compute and to analyze, the
loss has been extensively studied in the literature although it would also make sense to consider the KulbackLeibler loss, the Hellinger loss or the loss (which are somehow more intrinsic losses). For the same reasons as those mentioned above, we shall deal with the loss in this paper and all the comparisons that we shall make between the method that we propose and other methods will be performed relatively to this loss. Focusing on the loss, two classes of bandwidth selection have been well studied and are commonly used: cross validation and plugin. They correspond to different ways of estimating the ISE (integrated squared error) which is just the square of the loss between the estimator and the true density or the MISE (mean integrated squared error), which is just the expectation of the preceding quantity. The leastsquare crossvalidation (LSCV) [Rud82, Bow84] tends to minimize the ISE by replacing the occurence of the underlying density by the leaveoneout estimator. However LSCV suffers from the dispersion of the ISE even for large samples and tends to overfit the underlying density as the sample size increases. The plugin approaches are based on the asymptotic expansion of the MISE. Since the asymptotic expansion of the MISE involves a bias term that depends on the underlying density itself one can estimate this term by plugging a pilot kernel estimator of the true density. Thus this plugin approach is a bit involved since it depends on the choice of a pilot kernel and also on the choice of the pilot bandwidth. The socalled ”rule of thumb” method [Sil86] is a popular ready to be used variant of the plugin approach in which the unknown term in the MISE is estimated as if the underlying density were Gaussian. A survey of the different approaches and a numerical comparison can be found in [HSS13].These methods have been first proposed and studied for univariate data and then extended to the multivariate case. The LSCV estimator for instance has been adapted in [Sto84] to the multivariate case. A multivariate version of the smooth crossvalidation is presented in [CD11] and [DH05]. The rule of thumb is studied in [Wan92] and the multivariate extension of the plugin is developed in [WJ94b, CD10]. Generally speaking, these methods have some well known drawbacks: crossvalidation tends to overfit the density for large sample while plugin approaches depend on prior informations on the underlying density that are requested to estimate asymptotically the bias part of the MISE and which can turn to be inaccurate especially when the sample size is small.
The Penalised Comparison to Overfitting (PCO) is a selection method that has been recently developped in [LMR17]. This approach is based on the penalization of the
loss between an estimator and the overfitting one. It does not belong to the family of crossvalidation methods nor to the class of plugin methods, but lies somewhere between these two classes. Indeed the main idea is still to mimic the oracle choice of the bandwidth but the required bias versus variance tradeoff is achieved by estimating the variance with the help of a penalty term (as in a plugin method) while the bias is estimated implicitly from the penalization procedure itself as in a crossvalidation method. More precisely, the criterion to minimize is obtained from the biasvariance decomposition of the Mean Integrated Squared Error (MISE) leading to a penalty term which depends on some ultimate tuning parameter. It is proved in
[LMR17] that asymptotically, the theoretical optimal value for this tuning parameter is 1 which makes it a fully ready to be used bandwidth selection method.In this paper, we aim at completing the theoretical work performed in [LMR17] by some simulation studies that help to understand wether PCO behaves as well as expected from theory. In particular we have in mind to check wether choosing the penalty constant as 1 as predicted by theory is still a good idea for small sample sizes. We also compare the numerical performances of PCO to some of the classical crossvalidation and plugin approaches. Of course, a large number of bandwidth selection methods have been proposed for kernel density estimation and some of them have in addition many variants. We do not pretend to provide a complete survey of kernel methodologies for density estimation but our purpose is to compare PCO to what we think are the most extensively used methods. As seen later on, PCO offers several advantages which should be welcome for practitioners:

It can be used for moderately high dimensional data

To a large extent, it is freetuning

Its computational cost is quite reasonable
Furthermore, PCO satisfies oracle inequalities, providing theoretical guarantees of our methodology. So, we naturally lead a numerical study to compare PCO with some popular methods sharing similar advantages. More precisely, we focus on the Rule of thumb, the LeastSquare CrossValidation, the Biased CrossValidation, the Smoothed CrossValidation and the Sheather and Jones Plugin approach. These approaches are the most extensively used ones and to the best of our knowledge, the only ones for which a complete and general Rpackage has been developed to deal with multivariate densities , with . This package is ks. To the best of our knowledge, optimal theoretical properties of all these methods have not been proved yet.
Concretely, the performance of each of the method that we analyze is measured in terms of the MISE of the corresponding selected kernel estimator (more precisely we use the MonteCarlo evaluation of the MISE rather than the MISE per se). We present the results obtained for several ”test” laws. We borrowed most of these laws from [MW92] for univariate data and [Cha09] for bivariate data (and we use some natural extensions of them for multivariate data, up to dimension 4). In [LMR17] the bandwidth matrix is supposed to be diagonal. Since in practice it is important to take into account correlations between variables, we extend their oracle inequality results to symmetric definitepositive bandwidth and we also numerically compare the sodesigned PCO bandwidth selection to the classical approaches for nondiagonal bandwidth.
The section 2 is devoted to the presentation of all the methods used for the numerical comparison. The numerical results for univariate data are detailed in 3.3 and in section 3.4 for multivariate data. The extension of the oracle inequality of PCO for nondiagonal bandwidth is given in section 2.2.1 and the associated proofs are in section A.
Notations:
The bold font denotes vectors. For any matrix
, we denote the transpose of . denotes the trace of the matrix .2 Bandwidth selection
Due to their simplicity and their smoothing properties, kernel rules are among the most extensively used methodologies to estimate an unknown density, denoted along this paper, where . For this purpose, we consider an sample with . The kernel density estimator, , is given, for all , by
(1) 
where is the kernel function, the matrix is the kernel bandwidth belonging to a fixed grid and Of course, the choice of the bandwidth is essential from both theoretical and practical points of view. In the sequel, we assume that verifies following conditions, satisfied by usual kernels:
(2) 
where denotes the norm on . Most of selection rules described subsequently are based on a criterion to be minimized. We restrict our attention to criteria even if other approaches could be investigated. For this purpose, we introduce the Integrated Square Error (ISE) of the estimator defined by
(3) 
and the mean of :
(4) 
2.1 Univariate case
We first deal with the case and we denote the sample of density . The general case is investigated in Section 2.2. The bandwidth parameter lies in and is denoted , instead of . In our perspective, it is natural to use a bandwidth which minimizes or . However, such functions strongly depend on . We can relax this dependence by using an asymptotic expansion of the MISE:
(5) 
when exists and is squareintegrable, with . We refer the reader to [WJ94a] who specified mild conditions under which is close to when . The main advantage of the AMISE criterion lies in the closed form of the bandwidth that minimizes it:
(6) 
Note however, that still depends on through . The Rule of Thumb developed in [Par62] and popularized by [Sil86] (and presented subsequently) circumvents this problem. Crossvalidation approaches based on a direct estimation of constitute an alternative to bandwidth selection derived from the AMISE criterion. Both approaches can of course be combined. Before describing them precisely, we first present the PCO methodology for the univariate case.
2.1.1 Penalized Comparison to Overfitting (PCO)
Penalized Comparison to Overfitting (PCO) has been proposed by Lacour, Massart and Rivoirard [LMR17]
. We recall main heuristic arguments of this method for the sake of completeness. We start from the classical biasvariance decomposition
where for any , , with the convolution product. It is natural to consider a criterion to be minimized of the form
where is an estimator of the bias and an estimator of the variance . Minimizing such a criterion is hopefully equivalent to minimizing the MISE. Using that is (tightly) bounded by , we naturally set , with some tuning parameter. The difficulty lies in estimating the bias. Here we assume that , the minimum of the bandwidths grid, is very small. In this case, is a good approximation of , so that is close to . This is tempting to estimate this term by but doing so, we introduce a bias. Indeed, since
we have the decomposition
(7) 
But the centered variable can be written
So, the second term in the right hand side of (7) is of order Hence,
and then
These heuristic arguments lead to the following criterion to be minimized:
(8) 
Thus, our method consists in comparing every estimator of our collection to the overfitting one, namely before adding the penalty term
(9) 
The selected bandwidth is then
(10) 
In [LMR17], it is shown that the risk blows up when . So, the optimal value for lies in . Theorem 2 of [LMR17] (generalized in Theorem 2.1 of Section 2.2.1) suggests that the optimal tuning parameter is . It is also suggested by previous heuristic arguments (see the upper bound of ). In Section 3.3, we first conduct some numerical experiments and establish that PCO is indeed optimal for . We then fix for all comparisons so PCO becomes a freetuning methodology.
Connections between PCO and the approach proposed by Goldenshluger and Lepski are quite strong. Introduced in [GL08], the GoldenshlugerLepski’s methodology is a variation of the Lepski’s procedure still based on pairbypair comparisons between estimators. More precisely, Goldenshluger and Lepski suggest to use the selection rule
with
where denotes the positive part , and and are penalties to be suitably chosen (Goldenshluger and Lepski essentially consider or in [GL08, GL09, GL11, GL13]). The authors establish the minimax optimality of their method when and are large enough. However, observe that if , then, under mild assumptions,
so that our method turns out to be exactly some degenerate case of the GoldenshlugerLespki’s method. Two difficulties arise for the use of the GoldenshlugerLespki’s method: Functions and depend on some parameters which are very hard to tune. Based on 2 optimization steps, its computational cost is very large. Furthermore, the larger the dimension, the more accurate these problems are. Note that the classical Lepski method shares same issues.
2.1.2 Silverman’s Rule of thumb (RoT and RoT0)
The Rule of Thumb has been developed in [Par62] and popularized by [Sil86]. We assume that exists and is such that . The simplest way to choose is to use a standard family of distributions to minimize .
For a Gaussian kernel and
the probability density function of the normal distribution, an approximation of
can be plugged in (6) leading to a bandwidth of the form whereis an estimation of the standard deviation of the data. However this bandwidth leads to an oversmoothed estimator of the density for multimodal distributions. Thus it is better to use the following estimator, which works well with unimodal densities and not too badly for moderate bimodal ones:
(11) 
where is an estimation of the interquartile range of the data. Another variant of this approximation ([Sil86] p.4547) is:
(12) 
These two variants of the Rule of Thumb methodology are respectively denoted RoT and RoT0.
2.1.3 LeastSquare CrossValidation (UCV)
Leastsquare crossvalidation has been developed indepependently in [Rud82] and [Bow84]. In the univariate case, Equation (3) can be expressed as
(13) 
Since the last term of (13) does not depend on , minimizing (13) is equivalent to minimizing
(14) 
The leastsquare crossvalidation constructs an estimator of from the leaveone out estimator :
(15) 
where the leaveone out estimator is given by
(16) 
Then,
is unbiasedly estimated by
, which justifies the use of for the bandwidth selection and this is the reason why this estimator is also called unbiased crossvalidation (UCV) estimator. Using the expression of in (15) and replacing the factor with for computation ease, the following estimator is used in practice:(17) 
where , with the symbol ’’ used for the convolution product and . Finally, the bandwidth selected by the leastsquare crossvalidation is given by:
(18) 
2.1.4 Biased CrossValidation (BCV)
The biased crossvalidation was developed in [ST87]. It consists in minimizing the AMISE. So, we assume that exists and . Since the AMISE depends on the unknown density through , the biased crossvalidation estimates by
(19) 
Straightforward computations show that
which is close to when is small under mild conditions on . This justifies the use of the objective function of BCV defined by:
(20) 
Finally, the bandwidth selected by the BCV is given by:
(21)  
(22) 
2.1.5 Sheather and Jones Plugin (SJ)
This estimator is, as BCV, based on the minimization of the AMISE. The difference with the BCV approach is in the estimation of . In this plugin approach, is estimated by the empirical mean of the fourth derivative of , where is replaced by a pilot kernel density estimate of . Indeed, using two integrations by parts, under mild assumptions on , we have The pilot kernel density estimate is defined by:
(23) 
where is the pilot kernel function and the pilot bandwidth. Then, is estimated by with
(24) 
The pilot bandwidth is chosen in order to compensate the bias introduced by the diagonal term in (24) as explained in Section 3 of [SJ91]. Thus, for choosing , Sheather and Jones propose two algorithms based on the remark that, in this context, the pilot bandwidth can be written as a quantity proportional to or proportional to . The first algorithm, called ’solve the equation’ (’ste’), consists in taking the expression , pluging in (6) and solving the equation. The second algorithm, ’direct plugin’, consists in taking , and pluging in (6). Thus the SJ estimators of are given by:
(25) 
for the ’ste’ algorithm and
(26) 
for the ’dpi’ algorithm. The constant is where and are estimated by and with and the Silverman’s rule of thumb bandwidths respectively. The constant is equal to (see Equation (9) of [SJ91]), where is estimated by with the Silverman’s rule of thumb bandwidth.
2.2 Multivariate case
The difficulty of the multivariate case lies in the selection of a matrix rather than a scalar bandwidth. Different classes of matrices can be used. The simplest class corresponds to matrices of the form for . In this case, selecting the bandwidth matrix is equivalent to deriving a single smoothing parameter. However, the unknown distribution may have different behaviors according to the coordinate direction. The latter parametrization does not allow to take this specificity into account. An extension of this class corresponds to diagonal matrices of the form . But this parametrization is not convenient when the directions of the density are not those of the coordinates. The most general case corresponds to the class of all symmetric definite positive matrices, which allows smoothing in arbitrary directions. A comparison of these parametrizations can be found in [WJ93]. In this paper, we focus on diagonal and on symmetric definite positive matrices parametrization.
We now assume that the kernel satisfies
where is a finite positive constant. In the general setting of symmetric definite positive matrices, and using the asymptotic expansion of the bias and the variance terms, the MISE is usually approximated by the AMISE function defined by
with the Hessian matrix of . See [Wan92] for instance. Note that can also be expressed as
(27) 
where is the vector half operator which transforms the lower triangular half of a matrix into a vector scanning columnwise and the matrix is defined by
(28) 
with the Hessian matrix of and the diagonal matrix formed with the diagonal elements of A.
2.2.1 Penalized Comparison to Overfitting (PCO)
The PCO methodology developed in [LMR17] only deals with diagonal bandwidths . We now generalize it to the more general case of symmetric positivedefinite matrices to compare its numerical performances to all popular methods dealing with multivariate densities. We then establish theoretical properties of PCO in oracle and minimax approaches. To the best of our knowledge, similar results have not been established for competitors of PCO.
In the sequel, we consider , a finite set of symmetric positivedefinite matrices. Let
be a positive lower bound of all eigenvalues of any matrix of
. Then, set . We still considerwith
and Define
which goes to when goes to , under mild assumptions. The estimator verifies the following oracle inequality.
Theorem 2.1.
Assume that and is symmetric. Assume that . Let and . If then, with probability larger than ,
where is an absolute constant and if , if . The constant only depends on and and only depends on , and .
The proof of Theorem 2.1 is given in Section A. Up to the constant , the first term of the oracle inequality corresponds to the ISE of the best estimate when describes . The main assumption of the theorem means that cannot be smaller than up to a constant. When is taken proportional to , then the third term is of order and is negligible with proportional to . The second term is also negligible when is smooth enough and small (see Corollary 2.3 below).
Remark 2.2.
Note that and So, taking ensures that the leading constant of the main term of the right hand side is close to 1 when is small. Neglecting the other terms, this oracle inequality shows that the risk of PCO tuned with is not worse than the risk of the best estimate up to the constant , for any . Fixing as for the univariate case makes PCO a freetuning methodology.
From Theorem 2.1, we deduce that if is not too big and contains a quasioptimal bandwidth, we can control the MISE of PCO on the Nikol’skii class of functions by assuming that
has enough vanishing moments. The anisotropic Nikol’skii class is a smoothness space in
and it is a specific case of the anisotropic Besov class, often used in nonparametric kernel estimation framework (for a precise definition, see [GL14]). To deduce a rate of convergence, we focus on a parametrization of the form with some given matrix. The practical choice of is discussed in Section 3.4. The following result is a generalization of Corollary 7 of [LMR17]. We set the set of positive integers. We say that a kernel is of order if for any nonconstant polynomial of degree smaller than ,Corollary 2.3.
Let
an orthogonal matrix. Consider
with and choose for the following set of bandwidths:Assume that belongs to the anisotropic Nikol’skii class . Assume that the kernel is order . Then, if is bounded by a constant ,
where is a constant only depending on , and and
2.2.2 Rule of thumb (RoT)
For a general parametrization, in [Wan92], the authors derive the formula for the AMISE expansion of the MISE and also look at the particular case of the multivariate normal density with a gaussian kernel. More precisely, the AMISE expansion given by Equation (27) depends on through . The easiest way to minimize the AMISE is to take, for , the multivariate Gaussian density with mean and covariance matrix in the expression of (see (28)), combined with , the standard Gaussian kernel, in the AMISE expression (see (27)). Then, the AMISEoptimal bandwidth matrix is
(29) 
where is the empirical covariance matrix of the data [Wan92].
2.2.3 LeastSquare CrossValidation (UCV)
The multivariate generalisation of the leastsquare crossvalidation was developed in [Sto84]. It can easily be observed that computations leading to the CrossValidation criterion for univariate densities can be extended without any difficulty to the case of multivariate densities and we set
(30) 
with
(31) 
where , still by denoting ’’ the convolution product and .
2.2.4 Smoothed CrossValidation (SCV)
The Smoothed CrossValidation (SCV) approach proposed by [DH05] is based on the improvement of the AMISE approximation of the MISE by replacing the first term of (27) with the exact integrated squared bias. Then, crossvalidation is used to estimate the bias term. Therefore, the objective function for the multivariate SCV methodology is
(32) 
where is the pilot kernel and the pilot bandwidth matrix and the selected bandwidth is then
(33) 
See Section 3 of [DH05] or Sections 2 and 3 of [CD11] for more details. To design the pilot bandwidth matrix, Duong and Hazelton [DH05] restrict to the case for , whereas Chacón and Duong [CD11] consider full matrices.
2.2.5 Plugin (PI)
In the same spirit as the onedimensional SJ estimator described in Section 2.1.5, the goal of the multivariate plugin estimator is to minimize which depends on the unknown matrix whose elements are given by the ’s for all such that and defined by
(34) 
In [WJ94b], the elements are estimated by
(35) 
where, as usual, is a pilot kernel and a pilot bandwidth matrix. Some limitations of this approach are emphasized in [WJ94b]. This is the reason why [CD10] alternatively suggests to estimate by using
(36) 
with the th Kronecker product. Section 4.1 of [CD10] describes the choice of and finally the selected bandwidth is given by
(37) 
where
3 Numerical study
To study the numerical performances of the PCO approach, a large set of testing distributions has been considered. For the univariate case, we use the benchmark densities proposed by [MW92] whose list is slightly extended. See Figure 23 in Appendix and Table 2 for the specific definition of 19 univariate densities considered in this paper. For multivariate data, we start from the 12 benchmark densities proposed by [Cha09] and PCO is tested on an extended list of 14 densities (see Table 3 and Figure 24). Their definition is generalized to the case of dimensions 3 and 4 (see Tables 4 and 5 respectively). We provide 3dimensional representations of the testing densities in Figure 39.
3.1 PCO implementation and complexity
This section is devoted to implementation aspects of PCO and we observe that its computational cost is very competitive with respect to competitors considered in this paper. We first deal with the univariate case for which three kernels have been tested, namely the Gaussian, the Epanechnikov and the biweight kernels, respectively defined by:
For any kernel , If is the Gaussian kernel, , and the penalty term defined in (9) can be easily expressed:
For the Epanechnikov kernel, we have and
With a biweight kernel, since , the penalty term becomes
Moreover, the loss can be expressed as
With a Gaussian kernel, this formula has a simpler expression: