DeepAI
Log In Sign Up

lpdensity: Local Polynomial Density Estimation and Inference

06/15/2019
by   Matias D. Cattaneo, et al.
0

Density estimation and inference methods are widely used in empirical work. When the data has compact support, as all empirical applications de facto do, conventional kernel-based density estimators are inapplicable near or at the boundary because of their well known boundary bias. Alternative smoothing methods are available to handle boundary points in density estimation, but they all require additional tuning parameter choices or other typically ad hoc modifications depending on the evaluation point and/or approach considered. This article discusses the R and Stata package lpdensity implementing a novel local polynomial density estimator proposed in Cattaneo, Jansson and Ma (2019), which is boundary adaptive, fully data-driven and automatic, and requires only the choice of one tuning parameter. The methods implemented also cover local polynomial estimation of the cumulative distribution function and density derivatives, as well as several other theoretical and methodological results. In addition to point estimation and graphical procedures, the package offers consistent variance estimators, mean squared error optimal bandwidth selection, and robust bias-corrected inference. A comparison with several other density estimation packages and functions available in R using a Monte Carlo experiment is provided.

READ FULL TEXT VIEW PDF
04/21/2022

lpcde: Local Polynomial Conditional Density Estimation and Inference

This paper discusses the R package lpcde, which stands for local polynom...
11/28/2018

Simple Local Polynomial Density Estimators

This paper introduces an intuitive and easy-to-implement nonparametric d...
11/18/2022

A reliable data-based smoothing parameter selection method for circular kernel estimation

A new data-based smoothing parameter for circular kernel density (and it...
12/16/2019

Estimation of Auction Models with Shape Restrictions

We introduce several new estimation methods that leverage shape constrai...
09/20/2018

Kernel Density Estimation with Linked Boundary Conditions

Kernel density estimation on a finite interval poses an outstanding chal...
04/23/2018

Positive data kernel density estimation via the logKDE package for R

Kernel density estimators (KDEs) are ubiquitous tools for nonparametric ...
09/05/2019

Block bootstrap optimality for density estimation with dependent data

Accurate approximation of the sampling distribution of nonparametric ker...

1 Introduction

Nonparametric estimation of a probability density function (PDF), as well as its underlying cumulative distribution function (CDF) or higher-order derivatives thereof, plays an important role in empirical work across many disciplines. Sometimes these quantities are the main object of interest, while in other cases they are useful ingredients in forming more complex nonparametric or semiparametric statistical procedures. For textbook introductions to kernel-based density and local polynomial methods see

Wand and Jones (1995) and Fan and Gijbels (1996).

This article discusses the main methodological and numerical features of the software package lpdensity, available in both R and Stata, which implements the novel local polynomial smoothing approach proposed in Cattaneo, Jansson and Ma (2019) for estimation of and inference on a smooth CDF, PDF, and derivatives thereof. In a nutshell, the idea underlying this approach is to first approximate the discontinuous empirical distribution function using a smooth local polynomial approximation, which then is employed to construct a smooth local polynomial estimator of the distribution function, density function, and associated higher-order derivatives.

The resulting local polynomial density estimator is intuitive and easy to implement, and exhibits several interesting theoretical and practical features. For example, it does not require pre-binning or any other complicated pre-processing of the data, and enjoys all the celebrated features associated with local polynomial regression estimation (Fan and Gijbels, 1996). In particular, it automatically adapts to the (possibly unknown) boundaries of the density’s support, a feature that is unavailable in the case of most other density estimators in the literature; see Karunamuni and Albert (2005) for a review on this topic. Two exceptions are the local polynomial density estimators of Cheng, Fan and Marron (1997) and Zhang and Karunamuni (1998), which requires pre-binning of the data or, more generally, pre-estimation of the density near the boundary, thereby introducing additional tuning parameters that need to be chosen for implementation. In contrast, the density estimator implemented in the package lpdensity requires choosing only one tuning parameter: the bandwidth entering the local polynomial approximation, for which the package also includes fully data-driven mean squared error (MSE) optimal bandwidth selection methods. Furthermore, following the results in Calonico, Cattaneo and Farrell (2018)

, robust bias-corrected inference methods are also implemented, allowing for the use of the MSE-optimal bandwidth choice when forming confidence intervals or conducting hypothesis test in applications.

The software implementation covers smooth estimation of the distribution and density function, and derivatives thereof, for any polynomial order at both interior and boundary points, offering an automatic boundary-adaptive smooth kernel-based density estimator. Cattaneo, Jansson and Ma (2019)

give formal large-sample statistical results for these estimators, including (i) asymptotic expansions of the leading bias and variance, (ii) asymptotic Gaussian distributional approximations, (iii) consistent standard error estimators, and (iv) consistent data-driven bandwidth selection based on asymptotic MSE expansions of the point estimators. Importantly, all these results apply to both interior and boundary points in a fully automatic, data-driven way. We briefly summarize these results in the upcoming sections, and illustrate them numerically, including a comparison with other methods available in

R.

In the remaining of this article, we focus on the R implementation of the software package lpdensity. We note, however, that all functionalities are also available in Stata, and we do provide replication code for both statistical software platforms. The package includes the following three functions/commands.

  • lpdensity(). This function/command implements the local polynomial approximation to the empirical distribution function for a grid of evaluation points, and offers smooth point estimators of the CDF, PDF, and derivatives thereof. The function takes the bandwidth for each evaluation point as given, and if not provided it then employs the companion function lpbwdensity() described below for fully data-driven bandwidth selection. Variance estimation is fully automatic, and inference is implemented in two ways: (i) standard inference methods assuming undersmoothing or no smoothing bias, and (ii) robust bias correction. Because all the methods are automatic and boundary adaptive, the grid of evaluation points can expand the full support of the data.

  • lpbwdensity(). This function/command offers pointwise and intergrated MSE-optimal data-driven bandwidth selectors for the local polynomial CDF, PDF, and higher-order derivatives estimators implemented in lpdensity(). The selectors are boundary adaptive (in rates) for all evaluation points, even when the boundary location is unknown, and consistent under an additional condition on the local polynomial fit (discussed below). Both rule of thumb (ROT) and direct plug-in (DPI) implementations are available.

  • lpdensity.plot(). This function/command can be used to plot the estimated CDF, PDF, or higher order derivatives thereof for graphical illustration. The function takes the output from lpdensity(), and plots both the point estimates (on a collection of grid points) and the (pointwise) robust bias-corrected confidence intervals. This function builds on ggplot2 in R.

There are several other packages and functions available for kernel-based density estimation in R. Table 1 gives a summary of their functionalities. As shown in that table, the package lpdensity is the first to offer valid bandwidth selection, estimation and inference for both interior and boundary points, covering the CDF, PDF and derivatives thereof. Section 4 compares the numerical performance of these packages in a simulation study.

This article continues as follows. Section 2 offers a brief, self-contained overview of the main ideas underlying point estimation, bandwidth selection, and inference for the kernel-based local polynomial density methods implemented, including other related methodological features and details. Section 3 illustrates the main features of the package lpdensity with both simulated data and an empirical application in program evaluation, and Section 4 showcases its finite-sample performance, and compare with other R packages implementing kernel-based density estimators. Section 5 concludes. Installation details, scripts replicating the numerical results reported herein, links to software repositories, and other companion information, can be found in the package’s website: https://sites.google.com/site/lpdensity/.

Package Last Bandwidth Density Valid for Standard Valid
Function Update Selection Derivative Boundary Error Inference
GenKern 2013-11-11
  KernSec
KernSmooth 2015-06-29
  bkde
  locpoly
ks 2019-02-20
  kdde
  kde
np 2018-10-25
  npudens
nprobust 2019-01-10
  kdrobust
plugdensity 2011-11-30
  plugin.density
sm 2018-09-27
  sm.density
stats::density
lpdensity 2019-04-15
  lpdensity

Notes: indicates the feature is available, indicates the feature is not available, and indicates that inference is available and valid if undersmoothing is used but that is not the default in the package (and hence inference in invalid by default).

Table 1: Comparison of R Packages

2 Methodology and Implementation

This section offers a brief overview of the main methods implemented in the R and Stata package lpdensity. For formal results, including assumptions, proofs and any other technical details see Cattaneo, Jansson and Ma (2019, hereafter CJM). Here we focus on the main ideas and implementation details instead.

We assume that

is a random sample from the random variable

, where denotes its smooth cumulative distribution function, denotes its smooth probability density function, and

denotes its support, which can be bounded or unbounded. As it is well known, conventional kernel density estimators will be biased at or near boundary points, and other density estimators must be used if the goal is to estimate a density function on its full (compact, say) support. The package

lpdensity implements a simple, easy-to-interpret and boundary adaptive density estimator based on local polynomial methods. As a by-product, the package also offers a smooth local polynomial estimate of the CDF as well as higher-order density derivatives. To cover all cases in an unified way, we employ the conventional notation and for any smooth function , and define , , and derivatives of the density function as with , with . Therefore, in the sequel the parameter refers to the -derivative of the CDF (i.e., the -derivative of the PDF).

2.1 Local Polynomial Distribution and Density Estimation

To describe the estimators implemented in the package lpdensity, consider first the weighted empirical distribution function

where denotes the indicator function, and the weights are introduced for empirical applications such as missing data or counterfactual calculations. We assume these weights are normalized to . Furthermore, the package lpdensity allows for a general (possibly estimated) weighting scheme embedded in , although we will assume throughout this article that for simplicity, that is, is the standard root- consistent empirical distribution function estimator of .

As an alternative to conventional kernel density estimators, consider an estimator that first smooths out using local polynomial techniques, and then constructs an estimator of and its derivatives using this local polynomial approximation. Specifically, for , the estimator implemented in the package lpdensity is

(1)

where is the -th order polynomial expansion, is a kernel function such as the uniform or triangular kernel, and is a positive vanishing bandwidth sequence. In words, the estimator approximates the discontinuous empirical distribution function by a smooth local polynomial expansion using the weighting scheme implied by the kernel function, localized around the evaluation point according to the bandwidth . CJM showed that

where denotes convergence in probability as and . This implies that the local weighted least squares coefficients are consistent estimators of the CDF, PDF, and derivatives thereof, at the evaluation point , up to a known multiplicative factor.

Therefore, the generic local polynomial distribution estimator takes the form:

where denotes the conformable

-th unit vector. This estimator is implemented in the function/command

lpdensity(), given a choice of evaluation point , polynomial degree , derivative order , kernel function , and bandwidth parameter . In particular, the local polynomial density estimator is , which is implemented via the default lpdensity(...,p=2,v=1), that is, using a quadratic approximation to the empirical CDF leading to the density estimator . Similarly, for any , also gives a smoothed-out estimator of higher-order derivatives of the CDF and PDF functions, which are useful, for example, for bandwidth selection in nonparametrics.

While the smooth CDF estimator is useful, the density and higher-order derivatives estimator is perhaps more relevant for empirical work because of its interesting features. In particular, as mentioned before, the PDF estimator is intuitive and very easy to implement, while also being boundary adaptive. Thus, the estimator can be computed for all evaluation points on the support of in an automatic and straightforward way. This explains why the main functions in the package lpdensity refer to density estimation, despite smooth local polynomial CDF estimation also being available. Furthermore, high-order density derivative estimation is also readily available and boundary adaptive. For example, the first derivative of the PDF estimated using a local quartic approximation to the empirical CDF is implemented by lpdensity(...,p=4,v=2).

The rest of this section outlines the main properties, both statistical and numerical, of the estimator and describes other related issues such as bandwidth selection, variance estimation and valid (robust bias-corrected) inference.

2.2 Mean Squared Error

The estimator can be written in the familiar weighted least squares form: with the usual polynomial design matrix and the usual diagonal matrix of kernel weights. The only difference relative to the standard local polynomial regression literature is that now the “outcome” (or left-hand-side variable) is estimated, that is, .

Unlike other local polynomial density estimators proposed in the literature (e.g., Cheng, Fan and Marron, 1997; Zhang and Karunamuni, 1998), the estimator does not require pre-binning of the data or any other pre-processing of the observations beyond constructing the empirical distribution function, a procedure that is fully automatic and does not rely on additional tuning parameter choices (e.g., this estimation approach removes the need of choosing the number, position and length of the bins in a preliminary histogram estimate). Furthermore, the CDF estimator is -consistent, and its statistical properties are well-understood.

CJM obtained a general stochastic approximation to the bias and variance of the estimator , , for all evaluation points . Here we discuss the leading case of density estimation and derivatives thereof. For any choice of polynomial order , and , the variance and bias approximation for can be expressed, respectively, as follows:

where , and denote pre-asymptotic matrices that can be implemented directly using only the data, choice of (preliminary) bandwidth, evaluation point, polynomial order, derivative order, and kernel function. Furthermore, it can be shown that , and converge (in probability) to well-defined non-random limiting objects. In particular, with a quantity depending only on the estimator features such as the evaluation point, kernel, polynomial order, etc. Similarly, the quantities and depend on the evaluation point, the kernel function and choices of and . All these quantities are obtained in pre-asymptotic form, which has been shown to offer higher-order improvements in the context of local polynomial regression (Calonico, Cattaneo and Farrell, 2018).

In pre-asymptotic form, the above approximations are valid in all cases and for all evaluation points, so we can define a generic pointwise MSE-optimal bandwidth choice:

Note that the optimal bandwidth also depends on , which we suppress to conserve notation. It follows that, under standard regularity conditions, is MSE-optimal in rates for all evaluation points and choices of and ; and MSE-optimal in constants and rates for either (i)

is odd or (ii)

a boundary point. See Appendix A for more details.

We also define the integrated MSE (IMSE) optimal bandwidth choice as follows:

where denotes a user-chosen weighting scheme and dependence on is again suppressed to ease notation. In practice, the integral will be approximated relative to the grid points specified in lpdensity() or lpbwdensity(), allowing for both a uniform weighting () as well as the empirical distribution weighting.

The infeasible MSE-optimal and IMSE-optimal bandwidth selectors, and , are carefully developed so that they automatically adapt to boundary points, while also retaining their other main theoretical features (e.g., rate optimality). In practice, these bandwidth can be computed after replacing the unknown quantities by plug-in estimators thereof using preliminary bandwidth choices. We discuss implementation details of the different bandwidth choices available in the R and Stata package lpdensity further below.

2.3 Point Estimation and Robust Bias-Corrected Inference

Both infeasible bandwidth choices, and , and their feasible counterparts denoted by and and discussed below, can be used to construct an MSE-optimal IMSE-optimal point estimator for the PDF or its derivatives, respectively. The package lpdensity also allows for CDF estimation and computes an (I)MSE-optimal bandwidth selector, though we do not provide the details here to conserve space: because the CDF estimator is -consistent some details and stochastic approximations change in non-trivial ways; see CJM for more details.

As it is well known in the nonparametric literature, standard Wald-type inference is not valid when the (I)MSE-optimal bandwidth is used to construct the nonparametric point estimator. To be specific, the following distributional approximation holds for the standard Wald-type test statistic based on the local polynomial density estimator constructed using an (I)MSE-optimal or even a cross-validation-based bandwidth choice:

and the bias term cannot be dropped in general. In other words, if the point estimator is constructed using the (I)MSE-optimal bandwidth, inference based on the usual point estimator standard error confidence interval is invalid, due to the presence of asymptotic bias. Here and in the sequel denotes the quantile of the standard normal distribution.

Most statistical software packages currently available disregard this issue (see Table 1 for details). To be very clear, and considering the case of the density function for example, many standard software packages in R and Stata report the standard -percent nominal confidence interval , where the estimator is constructed using an MSE, IMSE or a cross-validation-type bandwidth, despite that in all these cases the resulting confidence interval not having -percent coverage rate, even in large samples. The mechanical solution to this inferential problem, often encountered in textbooks, is to undersmooth the estimator , that is, to construct the test statistic using an ad hoc bandwidth smaller than or for the point estimator. The function/command lpdensity() allows for this inference approach, of course, by simply running estimation and inference in two separate steps (see Section 3 for an illustration).

Calonico, Cattaneo and Farrell (2018)

recently showed that under the same assumptions employed to construct an (I)MSE-optimal bandwidth choice, it is suboptimal to employ undersmoothing for inference in nonparametrics. Instead, it is demonstrably better, in terms of higher-order distributional and coverage/inference properties, to employ robust bias correction (RBC): the idea is to bias correct the point estimator and then adjust the variance accordingly. Heuristically, and abusing notation for simplicity, this leads to the the Wald-type test statistic

which has a valid standard normal distribution even when an MSE, IMSE or a cross-validation-type bandwidth for is used. Confidence intervals with correct asymptotic size can be constructed by inverting the test statistic . In particular, it can be shown that a robust bias-corrected confidence interval is equivalent to employing the test statistic for a particular choice of parameters/implementation. Therefore, for inference the function/command lpdensity() employs by default an RBC test statistic assuming an (I)MSE-optimal or cross-validation-based bandwidth for the -th order point estimator is used, denoted generically by , and therefore forms the test statistic

and associated confidence intervals

where now the notation makes the bandwidth explicit to distinguish the two polynomial degrees used in constructing the point estimators and the RBC confidence interval/test statistic: (i) a -th order polynomial is used for point estimation (and bandwidth selection), and (ii) a -th order polynomial is used for inference. In general, it is possible to use even higher order polynomials to construct robust bias-corrected confidence intervals.

More generally, the package lpdensity implements confidence intervals of the form:

with determing the inference approach. The default option corresponds to robust bias correction, while setting gives the conventional (undersmoothed) confidence interval, which is invalid whenever an MSE-optimal-type bandwidth is used.

2.4 Bandwidth Estimation

The package lpdensity implements several bandwidth selectors through the function/command lpbwdensity(), including MSE-optimal and IMSE optimal plug-in rules, and a rule-of-thumb bandwidth selection based on a Gaussian reference model. Here we only outline the main aspects of bandwidth selection, but further details are given in the appendix.

To introduce our bandwidth selectors, recall that in the quantities , and are given in pre-asymptotic form, and hence they can be computed from the data directly, given a pilot/preliminary bandwidth. As a consequence, to construct the (I)MSE-optimal bandwidth estimators, the only unknown quantities are the higher order derivatives and , which can be consistently estimated using the local polynomial density derivative estimators implemented in lpdensity() with a pilot/preliminary bandwidth. To be precise, the MSE-optimal bandwidth is estimated by

with constructed by replacing and with their estimated counterparts using a pilot bandwidth. Similarly, the IMSE-optimal bandwidth selector is given by

where is the collection of grid points specified in the function/command (by default, takes on twenty quantile-spaced values over the support of the data). We provide more details on bandwidth selectors, including the rule-of-thumb bandwidth in Appendix A.

2.5 CDF Estimation

While the estimator is valid for all , all the results above focused on the case because the resulting estimators of the density () and its derivatives () are the main focus of the package. Nevertheless, as a by-product, CJM develop analogous asymptotic estimation, inference and bandwidth selection results for the smoothed local polynomial estimator . These results are also implemented in the package lpdensity via the option v=0. For example, CDF estimation using a kernel-based local constant approximation is implemented by lpdensity(...,p=0,v=0), which by default estimates the corresponding MSE-optimal bandwidth (bwselect="mse-dpi") and employs a local linear approximation for inference (q=p+1=1). For further methodological, technical and implementation details see CJM’s supplemental appendix.

3 Implementation and Numerical Illustration

In this section we showcase the numerical performance of the package lpdensity. The data consists of 2000 observations from the distribution . We create a discontinuity in density at to illustrate the performance of our procedure at boundaries. Figure 1 plots a histogram estimate and the true density function.

Figure 1: Histogram of Simulated Data. (Solid line: true density.)

3.1 Function lpdensity()

The function/command lpdensity() provides both point estimates as well as robust confidence intervals employing the local polynomial density estimator, given a grid of points and a bandwidth choice. If the latter is not provided, then by default the function/command chooses twenty quantile-spaced grid points over the support of the data and computes at each grid point.

To begin, the following command estimates the density function (v=1) with fixed bandwidth bw=0.5 over the grid of evaluation points , using a local quadratic polynomial approximation (p=2, the default) to the CDF. Robust bias-corrected confidence intervals over the grid are also computed, in this case using a local cubic polynomial approximation (q=3, the default) to the CDF.

# Density estimation on an evenly spaced grid with bandwidth 0.5
> model1 <- lpdensity(data$v1, bw = 0.5, grid = seq(-2, 1, 0.5))
> summary(model1)
Call: lpdensity
Sample size                              (n=)    2000
Polynomial order for point estimation    (p=)    2
Density function estimated               (v=)    1
Polynomial order for confidence interval (q=)    3
Kernel function                                  triangular
Bandwidth selection method                       user provided
=============================================================================
                                     Point      Std.       Robust B.C.
          Grid      B.W.   Eff.n      Est.     Error      [ 95% C.I. ]
=============================================================================
1      -2.0000    0.5000     478    0.2525    0.0129     0.2286 ,  0.3039
2      -1.5000    0.5000     695    0.3454    0.0141     0.3097 ,  0.3934
3      -1.0000    0.5000     779    0.4114    0.0153     0.3868 ,  0.4781
4      -0.5000    0.5000     680    0.3371    0.0139     0.2917 ,  0.3728
5       0.0000    0.5000     566    0.3114    0.0142     0.2778 ,  0.3576
-----------------------------------------------------------------------------
6       0.5000    0.5000     298    0.1118    0.0096     0.0683 ,  0.1227
7       1.0000    0.5000      40    0.0111    0.0030    -0.0012 ,  0.0156
=============================================================================

First part of the output provides basic information on the options specified in the function. The default estimand is the density function, indicated by Order of derivative estimated:1. Second part of the output gives estimation results, including (i) Grid: the grid points; (ii) B.W.: the bandwidths; (iii) Eff.n: the effective sample size for each grid point; (iv) Point Est.: the point estimates using polynomial order p, and the associated standard errors under Std. Error; (v) Robust B.C.[95% C.I.]: robust bias-corrected 95% confidence interval. More advanced users may wish to extract point estimates and standard errors for further statistical analysis. The output is stored in a standard matrix, and can be accessed as the following.

# Access lpdensity() output for further analysis
> model1$Estimate

summary() takes two additional arguments. The first one, alpha, specifies the (one-minus) nominal coverage of the confidence interval, with default being 0.05. Another argument is sep, which controls the horizontal separator. The default value is 5, meaning a dashed line is drawn every five grid points. This feature can be suppressed by setting it to 0. See the following example which produces 99% CI with dashed lines appear every three grid points.

# 99% CI; separator drawn every three grid points
> summary(model1, alpha = 0.01, sep = 3)

When the argument grid is suppressed, the evaluation points will be -quantiles computed from the data. If only point estimates are to be computed (i.e. without bias corrected confidence intervals), one can set q=0. For example (output is suppressed):

> summary(lpdensity(data$v1, bw = 0.5)) # grid on quantiles
> summary(lpdensity(data$v1, bw = 0.5, q = 0)) # no bias-corrected CI

It is also possible to suppress the argument bw, and the program will select bandwidth automatically by minimizing (estimated) mean squared error, employing the function/command lpbwdensity(). Other bandwidth selection methods are also available in the aforementioned function/command. We will illustrate our automatic bandwidth selection procedures in an upcoming subsection.

Another important argument in lpdensity() is scale, which scales the point estimates and standard errors. This is particularly useful if only part of the data is used. Recall that the data we use has density discontinuity at 0, hence one should, in principal, estimate using data and separately on the two sides of 0. Simply splitting the data will not give consistent estimates:

> # INconsistent density estimation, due to data-splitting
> lpdensity(data$v1[data$v1 < 0], bw = 0.5, grid = 0)$Estimate[, "f_p"]
[1] 0.2792337
> lpdensity(data$v1[data$v1 > 0], bw = 0.5, grid = 0)$Estimate[, "f_p"]
[1] 2.794478
> dnorm(0, mean = -1, sd = 1) # true density left to 0
[1] 0.2419707
> dnorm(0, mean = -0.5, sd = 0.5) # true density right to 0
[1] 0.4839414

The previous commands give point estimates , which is far from the true values . To have consistent estimates, we need to scale the estimates by the proportion of the data used for estimation:

> # Consistent density estimation with data-splitting
> lpdensity(data$v1[data$v1 < 0], bw = 0.5, grid = 0,
+   scale = sum(data$v1 < 0)/2000)$Estimate[, "f_p"]
[1] 0.2370694
> lpdensity(data$v1[data$v1 > 0], bw = 0.5, grid = 0,
+   scale = sum(data$v1 > 0)/2000)$Estimate[, "f_p"]
[1] 0.4219661

3.2 Function lpdensity.plot()

The package lpdensity also includes the plotting function lpdensity.plot(). This function takes output from lpdensity() and produces plots of point estimates and (pointwise) robust bias-corrected confidence intervals over the grid of evaluation points selected. Figure 2 shows how plots can be easily generated. Note that in general the confidence intervals are not centered at the point estimates. As described in Section 2, by default point estimates are constructed using mean squared error optimal bandwidth, which implies the smoothing bias is non-negligible and hence valid inference is obtained by employing robust bias-corrected confidence intervals, which need not to be centered at the point estimates (this would also happen if undersmoothing is used instead). The function lpdensity.plot() allows flexible customization. Figure 3 illustrates some of the features.

# Density plot
> model2 <- lpdensity(data$v1, bw = 0.5, grid = seq(-2, 1, 0.05))
> lpdensity.plot(model2, ylabel = "density")
Figure 2: Density Plot.
Figure 2: Density Plot.
(a) Error Bar
(b) Color
(c) Dahsed Confidence Intervals
(d) Nominal 80% Confidence Intervals
# Error bar
> lpdensity.plot(model2, type = "points", CItype = "ebar", CIshade = 0.8,
+   ylabel = "density")
# Change color
> lpdensity.plot(model2, lcol = 2, CIcol = 4, ylabel = "density")
# Dashed confidence interval
> lpdensity.plot(model2, CItype = "line", CIshade = 0.8, ylabel = "density")
# 90% confidence interval
> lpdensity.plot(model2, CItype = "line", CIshade = 0.8, alpha = 0.1,
+   ylabel = "density")
Figure 3: Density Plots with Different Specifications.
Figure 3: Density Plots with Different Specifications.

To close this subsection, we showcase how plots produced by lpdensity.plot() can be further modified or improved. Because lpdensity.plot() produces standard ggplot2 objects, additional features and/or geometric objects can be added. For example, we first estimate the density function separately on the two sides of , and then plot the two density estimates in one figure. Finally we add a histogram in the background. The resulting plot is given in Figure 4.

# Estimation with subsamples on the two sides of 0
> model3 <- lpdensity(data$v1[data$v1 < 0], bw = 0.5, grid = seq(-2,
+ 0, 0.05), scale = sum(data$v1 < 0)/2000)
> model4 <- lpdensity(data$v1[data$v1 > 0], bw = 0.5, grid = seq(0,
+ 1, 0.05), scale = sum(data$v1 > 0)/2000)
# Combine two sets of estimates in the same figure
> plot3_4 <- lpdensity.plot(model3, model4, CIshade = 0.6, ylabel = "density",
+   legendGroups = c("left of 0", "right of 0")) +
+ geom_histogram(data = data, aes(x = v1, y = ..density..),
+   breaks = seq(-2, 1, 0.1), fill = 4, col = "white", alpha = 0.2)
> plot3_4$layers <- plot3_4$layers[c(5, 1:4)]
> plot3_4
Figure 4: Density Plot with Histogram.
Figure 4: Density Plot with Histogram.

3.3 Function lpbwdensity()

The function lpbwdensity() implements four bandwidth selectors, (i) MSE-optimal plug-in bandwidth selector, denoted by "mse-dpi", (ii) IMSE-optimal plug-in bandwidth selector, denoted by "imse-dpi", (iii) rule-of-thumb bandwidth selector with Gaussian reference model, denoted by "mse-rot", and (iv) integrated rule-of-thumb bandwidth selector, denoted by "imse-rot". In this subsection we illustrate some of the main features of lpbwdensity() with the same simulated data used previously.

By default, the function computes the MSE-optimal bandwidth for estimating density with local quadratic regression and triangular kernel, on twenty quantile-spaced grid points over the support of the data. That is, the default is lpbwdensity(..., p=2, v=1, bwselect="mse-dpi", kernel="triangular"). The output resembles that of lpdensity(), which provides basic information for the data and options specified, as well as a matrix with columns of (i) Grid: grid of evaluation points, (ii) B.W.: estimated bandwidths, and (iii) Eff.n: effective sample size at each grid point given the estimated bandwidth. Here is an example with a used-chosen grid of evaluation points.

# Bandwidth selection for an evenly spaced grid
> model1bw <- lpbwdensity(data$v1, grid = seq(-2, 1, 0.5))
> summary(model1bw)
Call: lpbwdensity
Sample size                           (n=)    2000
Polynomial order for point estimation (p=)    2
Density function estimated            (v=)    1
Kernel function                               triangular
Bandwidth selection method                    mse-dpi
================================
          Grid      B.W.   Eff.n
================================
1      -2.0000    1.2823    1222
2      -1.5000    0.6448     891
3      -1.0000    0.5439     837
4      -0.5000    2.4392    1945
5       0.0000    0.4962     562
--------------------------------
6       0.5000    0.7309     420
7       1.0000    0.3567      23
================================

The bandwidths estimated from this function can be used as inputs for lpdensity(), but constructing bandwidths in a separate step is redundant: one can specify a bandwidth selection method through the option bwselect directly in lpdensity(). For example,

# Density estimation using data-driven MSE-optimal bandwidth
> model5 <- lpdensity(data$v1, grid = seq(-2, 1, 0.5))
> summary(model5)

It will be helpful to estimate bandwidth in a separate step if the user prefers to modify the bandwidths prior to estimation, such as in the case of ad hoc undersmoothing. To show this feature, we reproduce Figure 2 using estimated IMSE-optimal bandwidth, as well as ad hoc under-smoothing where the IMSE-optimal bandwidth is divided by 2 for estimation. See Figure 5 and the R code there.

(a) IMSE-bandwidth
(b) Under-smoothing
# IMSE-optimal bandwidth
> model6bwIMSE <- lpbwdensity(data$v1, grid = seq(-2, 1, 0.05),
+   bwselect = "imse-dpi")
# Estimation with IMSE-optimal bandwidth
> model6 <- lpdensity(data$v1, grid = seq(-2, 1, 0.05),
+   bw = model6bwIMSE$BW[, "bw"])
> lpdensity.plot(model6, ylabel = "density")
# Estimation with ad hoc under-smoothing
> model7 <- lpdensity(data$v1, grid = seq(-2, 1, 0.05),
+   bw = model6bwIMSE$BW[, "bw"]/2)
> lpdensity.plot(model7, ylabel = "density")
Figure 5: Density Plot with IMSE-optimal Bandwidth and Under-smoothing.
Figure 5: Density Plot with IMSE-optimal Bandwidth and Under-smoothing.

4 Simulation Evidence and Comparison with Other R Packages

Density estimation is available in many statistical packages. In this section, we illustrate the finite-sample performance of our lpdensity package in a simulation study, and compare it with other R packages implementing kernel-based density estimation procedures. Deng and Wickham (2011) also compares density estimation implementations in terms of speed, accuracy and frequency of updates. They review not only kernel-based density estimators but also procedures based on histogram/binning and penalization. However, Deng and Wickham (2011) does not discuss statistical inference or whether the estimator is valid for boundary evaluation points.

The functions/packages we consider are: KernSec() in GenKern (Lucy and Aykroyd, 2013); bkde() and locpoly() in KernSmooth (Wand and Ripley, 2015); kdde() and kde() in ks (Duong, 2007); npudens() in np (Hayfield and Racine, 2008); kdrobust() in nprobust (Calonico, Cattaneo and Farrell, 2019); plugin.density() in plugdensity (Herrmann and Maechler, 2011); sm.density() in sm (Bowman and Azzalini, 2018); as well as stats::density(). Table 1 provides a brief summary of their main features. First, most packages do not offer valid density estimates at (or near) boundaries, with the exception of KernSmooth and our lpdensity package. As explained in Section 2, lpdensity provides valid density estimation regardless whether the evaluation point is interior or boundary, and such boundary carpentry is fully automatic, meaning that the user does not need to explicitly specify a boundary kernel. Second, statistical inference is only available in four packages, np, nprobust, sm and lpdensity. However, among these four packages, only nprobust and lpdensity account for the possibly leading smoothing bias when constructing test statistics/confidence intervals. Finally, only three packages, KernSmooth, ks and lpdensity offer the feature to estimate density derivatives. Overall, the lpdensity package provides valid density estimates and derivatives thereof, for both interior and boundary evaluation points. For statistical inference, lpdensity takes into account the possibly leading smoothing bias, and hence delivers (asymptotically) valid testings and confidence intervals. With a set of bandwidth selectors, we believe lpdensity is an important addition to the toolkit for distribution estimation and inference.

Now we describe our simulation design. We consider the estimation of for (i.e., the density function and its derivative) at three evaluation points . The local polynomial order is chosen from , and we use the triangular kernel. The data consists of a random sample of size , generated from the normal distribution truncated above at . That is, for , where and are the distribution and density functions of the standard normal distribution. From our simulation design, the evaluation point, , is in the interior region; is the upper boundary; and represents a “near boundary” scenario. We use both population bandwidths obtained by minimizing the true asymptotic (integrated) mean squared error, as well as data-driven bandwidths constructed by each package. We employ 5,000 Monte Carlo repetitions. For the point estimate, we report its bias, standard error and root mean squared error. Whenever available, we also report the empirical coverage probability of a nominal 95% confidence interval, as well as its average length. Simulation results are summarized in Table 2–7 for the sample size , and Table 8-13 for the sample size . Except for KernSmooth::locpoly(), other packages do not employ a local polynomial regression approach. We report those results in tables with .

We first focus on Table 2. For the interior evaluation point (), mean squared errors (row 1–12, column 4 and 10) are of similar magnitude. For the boundary evaluation points ( and ), however, lpdensity has a much smaller mean squared error (row 23, 24, 35 and 36, column 4 and 10). This is because other packages do not correct for the well-known boundary bias, and hence their density estimates will be biased even in large samples. The boundary bias issue also features in statistical inference: our confidence interval consistently has an empirical coverage rate close to the nominal 95% level, while for the other three packages offering a inference procedure, the empirical coverage is almost zero, meaning that the boundary bias is so large that a nominal 95% confidence interval almost never covers the target parameter. In Table 5, we report simulation evidence for estimating the density derivative () using a local quadratic regression (). Although density derivative estimation is available in two other packages, neither provides statistical inference.

Similar conclusions can be drawn from other tables, and we do not discuss in detail for brevity. Note, however, that our confidence interval can be very wide for estimating the density derivative using a local quartic regression (, ), especially at boundary evaluation points (Table 7). The reason is twofold. First, estimating derivatives nonparametrically is always difficult, unless the sample size is very large. Second, our confidence interval is constructed based on a local polynomial regression of order . This is likely to further increase the variability of the point estimate, and hence will increase the confidence interval length.

5 Conclusion

We gave an introduction to the general purpose software package lpdensity, which offers local polynomial regression based estimation and inference procedures for CDF, PDF and higher order density derivatives. This package is available in both R and Stata statistical platforms, and further installation details and links to the latest version of the software can be found at https://sites.google.com/site/lpdensity/.

A Appendix: Details on Bandwidth Selection

In this appendix we provide more methodological details on bandwidth selectors implemented thorough the function lpbwdensity(). We continue to focus exclusively on the case , and therefore we do not discuss the case corresponding to CDF estimation, which is quite different because of its distinct properties (e.g., -consistency and asymptotic linearity). All cases, regularity conditions and other technical details can be found in CJM and their supplemental appendix.

a.1 Rule-of-thumb Bandwidth Selectors

Recall that, in the definition of and , we introduced pre-asymptotic matrices , and . For the rule-of-thumb bandwidth selectors, we consider a Gaussian reference model, hence all evaluation points are interior. Then asymptotically, those matrices have well-defined limits, and those limits are products of DGP-related quantities (such as normal densities or higher order derivatives) and nonrandom matrices which only depend on , and the kernel. Then the rule-of-thumb bandwidth can be computed, and we denote it by . An integrated version can be constructed accordingly, and is denoted by .

Given , and , the rate at which the MSE-optimal bandwidth shrinks to zero depends on whether is even or odd, and whether is interior or boundary. We summarize in Table 2 (panel (a)), as well as the rate at which the rule-of-thumb bandwidths shrinks (panel (b)). Note that the (I)ROT-optimal bandwidths have correct rate of convergence, except when is even and is near boundary, since by using a Gaussian reference model, it will treat any evaluation point as interior.

a.2 (I)MSE-optimal Bandwidths

We now discuss some implementation details of the MSE-optimal bandwidth, which will also apply for IMSE-optimal bandwidth. First the unknown higher order derivatives and are replaced by consistent estimated, and , where the IROT bandwidths are used to make the estimated derivatives stable. Here we augment the notation of bandwidth with one additional argument, since the bandwidth depends on both the polynomial order as well as the order of derivative. For example, is the estimated bandwidth in a Gaussian reference model, which is integrated-optimal for a local polynomial regression of order , when estimating the -th derivative.

The next step is to construct the pre-asymptotic matrices , and , and for this purpose, we also need a preliminary bandwidth. We use , so those matrices are , and . Then the MSE-optimal bandwidth is given by

with

Under very mild regularity conditions, it can be shown that is rate consistent (see Table 2, panel (c)). Under the assumption that either (i) is near boundary, or (ii) is odd, it is possible to prove stronger result: , so that MSE-optimal bandwidth selector is consistent both in rate and constant. What happens for interior with even? In this case , and captures only part of the leading bias. Hence has correct rate of convergence, but is not consistent for in the strong sense.

interior boundary
odd
even
(a)
interior boundary
odd
even
(b) and
interior boundary
odd
even
(c)
Table 2: Bandwidths Rates for Density Estimator ()

References

  • Bowman and Azzalini (2018) Bowman, A., and Azzalini, A. (2018), “sm: Smoothing Methods for Nonparametric Regression and Density Estimation,” R package version: 2.2-5.6.
  • Calonico et al. (2018) Calonico, S., Cattaneo, M. D., and Farrell, M. H. (2018), “On the Effect of Bias Estimation on Coverage Accuracy in Nonparametric Inference,” Journal of the American Statistical Association, 113, 767–779.
  • Calonico et al. (2019)   (2019), “nprobust: Nonparametric Kernel-Based Estimation and Robust Bias-Corrected Inference,” Journal of Statistical Software, forthcoming.
  • Cattaneo et al. (2019) Cattaneo, M. D., Jansson, M., and Ma, X. (2019), “Simple Local Polynomial Density Estimators,” Journal of the American Statistical Association, forthcoming.
  • Cheng et al. (1997) Cheng, M.-Y., Fan, J., and Marron, J. S. (1997), “On Automatic Boundary Corrections,” Annals of Statistics, 25, 1691–1708.
  • Deng and Wickham (2011) Deng, H., and Wickham, H. (2011), “Density Estimation in R,” working paper.
  • Duong (2007) Duong, T. (2007), “ks: Kernel Density Estimation and Kernel Discriminant Analysis for Multivariate Data in R,” Journal of Statistical Software, 21, 1–16.
  • Fan and Gijbels (1996) Fan, J., and Gijbels, I. (1996), Local Polynomial Modelling and Its Applications, New York: Chapman & Hall/CRC.
  • Hayfield and Racine (2008) Hayfield, T., and Racine, J. S. (2008), “Nonparametric Econometrics: The np Package,” Journal of Statistical Software, 27, 1–32.
  • Herrmann and Maechler (2011) Herrmann, E., and Maechler, M. (2011), “plugdensity: Plug-in Kernel Density Estimation,” R package version: 0.8-3.
  • Karunamuni and Albert (2005) Karunamuni, R., and Albert, T. (2005), “On Boundary Correction in Kernel Density Estimation,” Statistical Methodology, 2, 191–212.
  • Lucy and Aykroyd (2013) Lucy, D., and Aykroyd, R. (2013), “GenKern: Functions for generating and manipulating binned kernel density estimates,” R package version: 1.2-60.
  • Wand and Jones (1995) Wand, M., and Jones, M. (1995), Kernel Smoothing, Florida: Chapman & Hall/CRC.
  • Wand and Ripley (2015) Wand, M., and Ripley, B. (2015), “KernSmooth: Functions for Kernel Smoothing Supporting Wand & Jones (1995),” R package version: 2.23-15.
  • Zhang and Karunamuni (1998) Zhang, S., and Karunamuni, R. J. (1998), “On Kernel Density Estimation Near Endpoints,” Journal of Statistical Planning and Inference, 70, 301–316.