Log In Sign Up

Targeted Estimation of L2 Distance Between Densities and its Application to Geo-spatial Data

by   George Shan, et al.

We examine the integrated squared difference, also known as the L2 distance (L2D), between two probability densities. Such a distance metric allows for comparison of differences between pairs of distributions or changes in a distribution over time. We propose a targeted maximum likelihood estimator for this parameter based on samples of independent and identically distributed observations from both underlying distributions. We compare our method to kernel density estimation and demonstrate superior performance for our method with regards to confidence interval coverage rate and mean squared error.


page 1

page 2

page 3

page 4


A recursive procedure for density estimation on the binary hypercube

This paper describes a recursive estimation procedure for multivariate b...

Maximum Likelihood Estimation of Log-Concave Densities on Tree Space

Phylogenetic trees are key data objects in biology, and the method of ph...

Density estimation using Dirichlet kernels

In this paper, we introduce Dirichlet kernels for the estimation of mult...

On the integrated mean squared error of wavelet density estimation for linear processes

Let {X_n: n∈} be a linear process with density function f(x)∈ L^2(). We ...

Smoothness-Penalized Deconvolution (SPeD) of a Density Estimate

This paper addresses the deconvolution problem of estimating a square-in...

Log-transformed kernel density estimation for positive data

Kernel density estimators (KDEs) are ubiquitous tools for nonpara- metri...

Nonparametric Estimation of Band-limited Probability Density Functions

In this paper, a nonparametric maximum likelihood (ML) estimator for ban...

1 Introduction

Quantification of the difference between two probability densities can be useful. Some methods like the Kolmogorov-Smirnov score and supremum norm give a measure of maximum difference. Others like the Kullback-Leibler divergence (Kullback and Leibler, 1951) measure the divergence of one density from another. For a symmetric measure of overall differences, the L2 distance (L2D), otherwise known as the integrated square difference, can be useful. Here we study estimation of the L2 distance (L2D) between two unknown distributions using independent iid observations from both distributions. We also apply this estimation to make useful inferences from comparison of pairwise density differences in real data. Previous work with application to L2D has been done using different approaches, with applications including unsupervised change point detection and class balance estimation Sugiyama et al (2013). Estimation of the integral square functional has been examined previously, Bickel and Ritov (1988), Gine and Nickl (2008).

L2 distance between two functions is the integral of the squared difference between the two functions, in our case probability density functions;

It can provide a useful indication of the size of differences between two distributions. In the first part of this paper we provide a method of targeted estimation of the true L2 distance between two unknown distributions based on samples from those distributions. This estimator is called a targeted maximum likelihood estimator.

Differences between pairs of densities can be compared across many pairs by estimating the L2D for each pair and comparing those estimates. In the second part of this paper we illustrate the use of L2D estimation on a real world data set to identify the most significant changes resulting from an intervention.

The most straightforward way to approach the problem of L2D estimation is to estimate the densities of our two distributions separately and then to evaluate the L2D by plugging in our two estimates. This leads to the issue of how to deal with smoothing. In general we will operate under a non-parametric model for density estimation. Non-parametric density estimation requires us to smooth our empirical distribution to find a suitable estimate. However, the degree of smoothing depends on some measure of fidelity to the data generating distribution, such as the cross validated log-likelihood loss, or the mean integrated squared error. Finding an appropriate balance between bias and variance when smoothing a non-parametric density estimate does not guarantee the same optimality with regards to the L2D. The result is that a ’good’ estimator for the density of a distribution as a whole is not necessarily a ’good’ estimator for the L2 distance.

One clear motivation for a better approach is that L2D depends on the difference between two densities, whereas the typical bias/variance trade off for each individual density estimate is only concerned with that particular distribution. Smoothing is inevitable, but the smoothing should be tailored to our parameter of interest. This is the motivation behind targeted estimation. In order to ’target’ the L2D, we should consider how this parameter is sensitive to various parts of the distribution.

2 Targeted Estimation of L2 Distance

Consider random variable

, of which we have i.i.d. observations from distribution with density . is continuous multivariate with density everywhere defined and is Bernoulli. Denote as with density and likewise as with density , both conditional distributions of for any hypothetical distribution of and . Our operating model is non-parametric for both conditional densities and , respectively, and holds as arbitrarily fixed. Our estimand is:

Our approach is to develop a targeted maximum likelihood estimator (TMLE) based on the framework described by van der Laan and Rubin (2006) and in Targeted Learning (van der Laan and Rose, 2011). We will first define the TMLE and then establish that it is an asymptotically efficient estimator.

A TMLE is a two step substitution estimator. The first step is to estimate the true density . We call this the initial estimation step, which gives us an initial density estimate , with corresponding probability measure . In the second step we update to target it towards our parameter. Consider a submodel that includes and which is indexed by a single parameter with the property that the score at is , where is the canonical gradient of at . We say that this property is the least favorable model (LFM) property and the mentioned submodel possesses this property at so it is a local least favorable model (LLFM). There is a unique such submodel through such that the LFM property applies at every member element, not just at . We call this the universal least favorable model (ULFM). The member of ULFM which maximizes the likelihood of our observed data is the TMLE update, . Finally, the TMLE of is . van der Laan and Gruber (2016) provide a detailed discussion on these points.

In Appendix A we find the canonical gradient of to be:


It follows that the second order remainder defined as =

We can express the error of our estimator as:


An efficient estimator is a linear estimator with influence function equal to the canonical gradient of the estimand. Its error behaves as , where denotes the empirical measure. Such an estimator is the estimator with minimal asymptotic variance among all regularly asymptotically linear estimators. is asymptotically efficient if the last 3 expressions on the right side of Equation 2 converge to zero at a rate faster than , which leaves the first term, the efficient error, that behaves as a sample mean converging at

rate. This estimator has Gaussian distribution with mean equal to the estimand and standard error equal to standard deviation of

divided by .

If is the MLE for a sub-model with score at then the second term in Equation 2 is 0. If falls in a Donsker class and then the third term on the right side of Equation 2 is as a result of asymptotic equicontinuity of an empirical process indexed by a Donsker class (van der Vaart and Wellner, 1996). Finally, the second order remainder is if the density estimator converges in L2 norm faster than rate. These conditions would imply that . This means that under these generally achievable conditions, TMLE is asymptotically efficient. It is asymptotically linear with influence function equal to the canonical gradient of the parameter under the true distribution. It’s distribution is asymptotically normal and unbiased. Asymptotically accurate Wald confidence intervals can be constructed centered on and using standard deviation of divided by as standard error. Further elaboration on conditions and general satisfiability thereof for asymptotic efficiency of TMLE is provided in Appendix B.

For the initial density estimation we will use a Gaussian kernel density estimator with global bandwidth selected by HPI plug in method (Duong, 2007; Wand and Jones, 1994), for the sake of simplicity. In principle any type of regular, consistent estimator can be used, and the most accurate estimator or ensemble for the particular data should be used. It’s worth restating that the term depends on the accuracy of the updated density and thus also the initial density estimator. Asymptotic efficiency depends on converging faster than and finite sample performance depends on this term as well.

The update step is the main feature of the TMLE. We find a density estimate that solves its empirical efficient influence equation. There are many such solutions, but we want to be as close to the true distribution as possible. Our best guess at this is our initial estimate. We conceive of a single parameter sub-model that is part of the non-parametric model space and contains our initial estimate as one of its elements. The score of the sub-model at the initial estimate is equal to the canonical gradient applied to that estimate. We call this a ‘least favorable model’, because any perturbation within this parametric model leads to the largest possible change in our parameter (and least favorable to estimation of the estimand). It is ‘local’ in that the canonical gradient applied to the initial estimate is not the same as that applied to a nearby member of the model space, and thus the described property applies to the locality of the initial estimate, but not necessarily to anywhere else. There is one sub-model such that the score at every member is equal to the canonical gradient applied to that member distribution. This we call the ‘universal least favorable path’. A maximum likelihood estimate along this ‘universal path’ solves its empirical score equation and thus its empirical efficient influence equation. This is the targeted maximum likelihood estimate.

In practice, solving an MLE along the universal path requires repeated small updates starting with the initial estimate. Oftentimes using the gradient at the initial estimate as the score for the entire sub-model will find an MLE in one step that comes close to solving it’s empirical efficient influence equation. This is one particular local least favorable model. A rule of thumb for close: if empirical mean of efficient influence function is smaller in magnitude than its empirical standard deviation divided by , where is the sample size. Such a condition would imply that is small enough to not provide a meaningful contribution to the finite sample behavior of the TMLE, while preserving asymptotic efficiency. If this condition is not met, an additional update step can close the gap. In the work discussed here, one update was found to be sufficient to meet this stopping criterion.

The one step TMLE described above was implemented for our simulations and analysis. First, we apply our canonical gradient operator to the initial distribution estimate and obtain , the score of our local least favorable model.

This local least favorable model is , :

Care should be taken to ensure that perturbation by yields a valid density function. This is done by keeping small enough such that the resulting density estimate is not anywhere negative. We now compute the MLE of such that it maximizes the log-likelihood:

Our TMLE update is:

Note that here we use the empirical proportion for and this remains unchanged after the update step. We check that meets our aforementioned criterion of being less than the empirical standard deviation of canonical gradient divided by . If so, we plug in to and we have as our TMLE of the L2D.

3 Simulations: TMLE using initial kernel density estimator

We study the effect of TMLE update applied to an initial kernel density estimate through simulation. Since the TMLE depends on the initial estimate, and the kernel estimator depends on the regularity of the underlying distribution, we will start off by using three different types of data generating distribution with varying levels of regularity. For each type we will use two overlapping identical distributions with a certain amount of offset corresponding to a true L2 distance. These are Normal distribution with standard deviation 0.5 and 0.5 offset (referred to as ’Gaussian’), Isosceles triangle with base 2 and height 1 offset by 0.5 (’Triangle’), and uniform with range 1 offset by 0.1 (’uniform’). For each type of distribution, we will acquire 5000 simulations of sample size

from both distributions, where is 50, 100, 200 … 51200. For each simulation, we will estimate the L2D. This will be done for both kernel density estimates and TMLE adjusted kernel estimates.

To compare performance, we will assess plug in estimation using both TMLE updated density estimate as well as initial density estimate without update. In both cases we will use Gaussian kernel density estimate with HPI bandwidth. L2D between the kernel density estimates for each distribution is computed using numerical integration. Shown below are comparisons of the estimator performance using both methods, across the range of sample sizes.

For each distribution type we examine confidence interval coverage rate and mean squared error across the range of sample sizes. For coverage rate we compute 95% intervals that use sample L2D estimate +/- . The standard error is calculated using either the standard deviation of the sampling distribution (5000 samples), referred to as ’oracle’ since it assumes knowledge that one would not ordinarily have for a single sample, or by using the variance of the empirical influence curve for each sample separately (referred to as ’sample’). The first is the actual standard error, while the second is an estimate. In addition, we examine the mean squared error and variance for the estimates in our sampling distribution, and multiply these figures by the sample size . The variance of the true efficient influence function is shown as a horizontal black line for reference. This is the efficient rate, and any asymptotically efficient estimator should have mean squared error and variance converge to this rate as increases.

Figure 1: Graphs of the density curves of simulated distributions. Red and blue represent the two densities and .
Figure 2: Comparison of 95% confidence interval coverage rate and squared error for kernel density estimator alone and TMLE. Graphs show performance for different sample sizes, where horizontal axis measures the base 2 log of sample size divided by 50. Horizontal black line shows correct coverage rate on left and variance of canonical gradient (efficient rate) on right.

3.1 Results

We can see that for Gaussian and right isosceles triangle, which are reasonably regular distributions, the TMLE achieves correct confidence interval coverage and is asymptotic efficient. We note that the kernel density estimator by itself does not meet these criteria. In particular, we can see the decline of the coverage rate as sample size increases, and the divergence of MSE *

. For the step-wise uniform distribution, which is significantly less regular, we see that the TMLE does not achieve achieve accurate coverage and is not asymptotic efficient at large sample sizes. However it’s performance is even more significantly superior than that of the kernel estimator alone, compared to the other two types of simulations. This result is to be expected since the asymptotic performance of the TMLE estimator depends on the performance of the initial estimation step. The Gaussian kernel estimator is simply too poor when underlying regularity assumptions are violated. However, by looking at the mean square error, we see that it is actually in this situation that the TMLE leads to the best performance gain over the kernel alone. For smaller sample sizes we see that TMLE has higher variance and error because it fits the data. Kernel estimator alone has smaller variance and error at small sample sizes because of the degree of regularity of the distribution. We can see that the overtake point for error between TMLE and kernel alone is a larger sample size for Gaussian than for triangle, with step-wise uniform being the smallest. That tells us that the more accurate the regularity assumptions, the better the smooth kernel estimator will perform, but asymptotically the TMLE will always outperform it.

4 Application: San Francisco Crime Incidences

Here we demonstrate the use of L2D estimation to compare random processes in two time periods and identify the largest changes between these periods. We use publicly available data on San Francisco crime incidences (see Police Department Incident Reports). Around September 9, 2017, the San Francisco Police Department significantly increased the number of neighborhood patrol officers in several areas across the city (SFPD foot patrols). We are interested if this has had some effect on crime patterns. What types of crime are affected by this change? We will consider the geographical coordinate of crimes to be a bi-variate , and our binary whether an incident occurred before or after September 9. L2D provides a way to quantify crime pattern changes across the city after the patrol officers were added. This may help identify effects resulting from policing targeted to particular locations.

Our data consists of observations of recorded crime incidents, with each observation corresponding to one incident and including the following variables: geographic coordinates of the incident, date of incident, category of crime, outcome for that incident. We compare the geographic distribution of different categories of crime in the 80 days before September 9 and the 80 days after September 9. For each category separately, is the bi-variate geographic coordinates of longitude and latitude, for incidents that occurred in the 80 days prior to September 9 and for incidents that occurred in the 80 days after September 9. Thus and are the densities of the geographic coordinates of a particular category of crime before and after September 9, respectively. We restrict our analysis to the 19 categories of crimes for which there are at least 100 incidents in both the 80 days before and the 80 days after. This is to give us a reasonable number of observations for density estimation, without using too long a stretch of time where other causes may shift the patterns of crime. For each category we estimate the L2D between the bi-variate geographic density functions of the incident generating process before versus after September 9. For each category of crime and 80 day window, we treat the recorded incident locations as iid bi-variate observations of longitude and latitude. The result is 19 L2D estimates for 19 pairs of distributions, one for each category of crime. By comparing these 19 estimates, we can identify which categories experienced the largest shift in their geographic distribution before versus after the intervention.

Figure 3: Graph: L2D for each crime category between the density in the 80 days before the intervention and the density in the 80 days after the intervention. Two estimates are made for each category, one using kernel density estimator and the other using TMLE with kernel initial estimator. Two 95% wald confidence intervals are provided: blue - kernel ; red - TMLE. Both use the same standard error estimate: empirical standard deviation of canonical gradient applied to kernel density estimate. Kernel density estimator use bi-variate Gaussian kernel with global bandwidth from HPI selector of Wand and Jones (1994). Table: Sample sizes (number of incidents) for each category, in the 80 days before (left column), and the 80 days after (right column) September 9.

The largest L2 distance is with drug/narcotics incidents, and it is significantly larger than the others. Further analysis shows that these types of incidents are concentrated in the downtown Civic Center area and along Mission street, a commercial avenue. These would be the places most likely to be targeted by increased police foot and bicycle patrols. Furthermore, the data shows that the vast majority of drug/narcotics incidents result in arrests, as opposed to other types of incidents which mostly have low resolution rates. This suggests that most recorded drug/narcotics incidents correspond to interactions between police and suspects, whereas other types of recorded incidents are mostly cases of reports by third parties. It’s plausible that increased foot patrols will lead to greater numbers of interactions between officers and people using narcotics in public, and this could account for the distinctive change in density for this type of incident in particular. One might expect such changes to be reflected in the number of recorded incidents across the city, with a larger percentage increase in recorded drug/narcotics incidents compared to the other categories. This need not be the case however, as the progression from summer (prior to Sep 9) to autumn (after Sep 9) could have different effects on rates of different type of crimes. Indeed we don’t see a particularly large increase in the number of drug/narcotics incidents in the 80 days before to the 80 days after September 9. This illustrates an advantage of examining the probability densities and the L2 distance, which allows us to control for changes in rates affecting the entire distribution.

Examining L2 distance can provide insights into how different random processes are affected by some change or across time. Our analysis of the SFPD data illustrates how this can be done. Changes in probability density for particular crime categories can help identify candidate categories that experienced the largest changes, and help direct further investigation. From the L2 distance, it’s clear that drug/narcotics incidents stand out. Making this observation from a scatter-plot or a heat-map can be difficult. Furthermore, how the changes are distributed within categories/processes which experienced the largest overall changes can be analyzed using the difference in densities. This can be done conventionally or by the methods described by Sugiyama et al. (2013). The utility of TMLE can be seen in our analysis here as well. The coordinates of drug/narcotics incidents align closely with streets, which are lines on the map. For multivariate and/or irregularly distributed data, over-smoothing is particularly problematic for density estimation, and hence L2 distance estimation. TMLE performs targeted under-smoothing, which ameliorates this issue. We can see in this study that there can be a significant difference between kernel estimates and TMLE, so any improvement is not trivial.

5 Discussion

In estimation we are faced with the bias variance trade off. However, the right trade off depends on what is being measured. The benefit of TMLE is that we can adjust this trade off to a particular parameter, in this case the L2D. Estimation of using a kernel smoother without TMLE update often performs poorly, especially when the underlying distribution is irregular. In particular, over-smoothing of our density estimate will bias our estimate of L2D.

When is non-zero, this term contributes to the error of our estimator. Updating to removes (or reduces) this error term, but at the same time we wish to avoid increasing , which requires controlling the amount of change that we make to an accurate initial estimator. The solution is to refit, but in a manner that proportionately targets places where there is a large difference in density. The re-fitting in our update step depends across the distribution on the same , but differs according the gradient at each location, which reflects the difference in density. This is a form of targeted under-smoothing.

It should be noted that there is more than one solution to the problem of finding an estimate that solves its empirical influence equation. Any consistent estimator that converges faster than rate will allow the TMLE to be asymptotically efficient. But at finite samples, the closer the initial estimator is to the truth, the better our performance will be. Conversely, initial estimators which fail to converge at rate will not allow our TMLE to achieve asymptotic efficiency.

Finally, we will suggest some future topics of research. When using L2D to compare changes in density, it can be the case that some distributions are more concentrated than others, leading to larger squared differences when the relative changes are the same. Integrated square difference can be normalized by some appropriate function of the two densities to give a normalized distance metric that could better lend itself to comparison of random processes that have markedly different density distributions. Alternatively, instead of comparing different random processes, we could examine one random process over time. Sugiyama et al. (2013) examined the use of L2D estimation for change point detection in time series. The TMLE methodology could be used to build upon these approaches.

6 References

Bickel, P., & Ritov, Y. (1988). Estimating integrated squared density derivatives: Sharp best order of convergence estimates. Sankhya: The Indian Journal of Statistics, Series A 50 381–393.

Duong, T. (2007). ks: Kernel Density Estimation and Kernel Discriminant Analysis for Multivariate Data in R. Journal of Statistical Software, 21, 7.

Gine, E., & Nickl, R. (2008). A simple adaptive estimator of the integrated square of a density. Bernoulli, 14, 47-61.

Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22, 79–86.

Sugiyama, M., Kanamori, T., Suzuki, T., Christoffel du Plessis, M., Liu, S., & Takeuchi, I. (2013). Density-Difference
Estimation. Neural Computation, 25, 2734-2775.

van der Laan, M.J., & Rubin, D. (2006). Targeted Maximum Likelihood Learning. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 213.

van der Laan, M.J., & Rose, S. (2011). Targeted Learning. Springer-Verlag New York

van der Laan, M.J., & Gruber, S. (2016). One-Step Targeted Minimum Loss-based Estimation Based on Universal Least Favorable One-Dimensional Submodels. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 347.

van der Vaart, A.W., & Wellner, J.A. (1996). Weak Convergence and Empirical Processes. Springer-Verlag New York

Wand, M.P., & Jones, M.C. (1994). Multivariate Plug-in Bandwidth Selection. Computational Statistics, 9, 97-116.

Police Department Incident Reports: Historical 2003 to May 2018. Sep 12, 2018. Accessed April 1, 2019.

SFPD foot patrols will quadruple in SF Mission. August 31, 2017. Accessed April 1, 2019.

7 Appendix A: Derivation of canonical gradient

Consider jointly distributed

, multivariate continuous and Bernoulli, and parameter . For a model with and both non-parametric and fixed, the canonical gradient of at denoted is:

Define such that: . It follows from the above that:

The canonical gradient is the unique gradient of that is an element of the tangent space spanned by all scores of our model.

Consider such that for some score in our model as a path in our model through . The functional (or pathwise) derivative at can be expressed as a covariance under of a gradient with score . The unique which can be expressed as a score within our model is the canonical gradient

First we express and in terms of the conditional distributions and scores of . Observe that:

Now define and . and are scores for the conditional distributions with , . is known and static in our model. Thus, . By evaluating the derivative with respect to of at we can express the score of as:

Now we find at .

The above expression of the pathwise derivative can be decomposed as a covariance between a gradient and the score in our model as follows:

We can mean-centered both parts of the first expression in our above integrand and the value of the covariance will stay the same. We have now expressed the pathwise derivative as a covariance under of the score and a gradient. Note that this new gradient has the form of a score for the model, so it is the canonical gradient :

Given the canonical gradient, it is of interest to understand the error in using it as a first order approximation for the difference between the parameters of two members of our model, and . This error is a second order term which we will refer to as . It follows from the statement of the corollary that:

First we evaluate

It follows that:

8 Appendix B: Asymptotic Efficiency of TMLE

Consider a TMLE such that . Assume is an element of a class of multivariate real valued Cadlag functions on a cube with sectional variation norm bounded by a universal M. In addition, assume and . Then , i.e. is an asymptotically efficient estimator of .

Note that we can express:

is the empirical measure and is an empirical process. It follows that is as a result of the asymptotic equicontinuity of an empirical process indexed by a Donsker class (van der Vaart and Wellner), where is an element of the Donsker class of real valued Cadlag functions with bounded sectional variation norm.

Consistent density estimators will satisfy convergent in L2(P) and any defined density is Cadlag, with Cadlag as well. Bounded sectional variation norm requires that the density estimator is reasonably smooth. This depends on the kernel shape and bandwidth. We consider the HPI global bandwidth and other variable global bandwidth of order . Simulations were conducted for a Gaussian kernel (not shown) and illustrated that sectional variation norm is bounded for HPI and or greater. For or less divergence is observed. This illustrates that for most practical density estimators the aforementioned conditions will be met.

For it is sufficient that converges in L2(P) at a rate faster than . This depends on the choice of initial density estimator and the underlying distribution, but can be generally met by consistent density estimators.

Finally, our TMLE is a such that for the one step estimator and 0 when using the universal least favorable model.

We are left with the leading term which is the empirical mean of iid random variables and is thus consistent and converges to a normal mean-zero limit distribution. We have thus shown that is asymptotically equivalent to an empirical mean of the canonical gradient of our parameter, and thus it is asymptotically linear with influence curve the canonical gradient. is thus an asymptotically efficient estimator of .