Robust Comparison of Kernel Densities on Spherical Domains

While spherical data arises in many contexts, including in directional statistics, the current tools for density estimation and population comparison on spheres are quite limited. Popular approaches for comparing populations (on Euclidean domains) mostly involvea two-step procedure: (1) estimate probability density functions (pdfs) from their respective samples, most commonly using the kernel density estimator, and, (2) compare pdfs using a metric such as the L2 norm. However, both the estimated pdfs and their differences depend heavily on the chosen kernels, bandwidths, and sample sizes. Here we develop a framework for comparing spherical populations that is robust to these choices. Essentially, we characterize pdfs on spherical domains by quantifying their smoothness. Our framework uses a spectral representation, with densities represented by their coefficients with respect to the eigenfunctions of the Laplacian operator on a sphere. The change in smoothness, akin to using different kernel bandwidths, is controlled by exponential decays in coefficient values. Then we derive a proper distance for comparing pdf coefficients while equalizing smoothness levels, negating influences of sample size and bandwidth. This signifies a fair and meaningful comparisons of populations, despite vastly different sample sizes, and leads to a robust and improved performance. We demonstrate this framework using examples of variables on S1 and S2, and evaluate its performance using a number of simulations and real data experiments.



page 14

page 15

page 18

page 19


Adaptive optimal kernel density estimation for directional data

We focus on the nonparametric density estimation problem with directiona...

Nonparametric estimation of directional highest density regions

Reconstruction of sets from a random sample of points intimately related...

Data-Based Optimal Bandwidth for Kernel Density Estimation of Statistical Samples

It is a common practice to evaluate probability density function or matt...

PSD Estimation of Multiple Sound Sources in a Reverberant Room Using a Spherical Microphone Array

We propose an efficient method to estimate source power spectral densiti...

Adaptive Density Estimation on Bounded Domains

We study the estimation, in Lp-norm, of density functions defined on [0,...

Estimating densities with nonlinear support using Fisher-Gaussian kernels

Current tools for multivariate density estimation struggle when the dens...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The estimation of probability density functions (pdfs) and comparisons of underlying populations are fundamental problems in statistics. In a variety of situations, where data satisfy some natural constraints, it is better to view and analyze data as elements of a non-Euclidean manifold. A simple example is directional statistics, where one deals with analysis of data on a unit sphere. In order to understand the limitations of current solutions, for estimating and comparing densities on spherical domains, we start with a discussion of methods in Euclidean domains. The classical nonparametric estimate of a pdf, given samples from that density, is a kernel density estimate (Rosenblatt, 1956; Parzen, 1962). This approach is commonly used for Euclidean domains but can be easily adapted to spheres also. There are two key choices to be made in this estimation: (1) the kernel function, a symmetric unimodal function that integrates to one, and (2) the bandwidth. It is widely acknowledged that the choice of bandwidth is more influential than the choice of kernel in terms of pdf estimation performance. Henceforth, in this paper, we will fix the kernel to be an isotropic (i.e., circularly symmetric) Gaussian-type kernel and focus on the issues arising from using different bandwidths. The choice of Gaussian kernel facilitates a group structure that will be exploited later in this paper. To highlight the importance of bandwidth in density estimation, Figure 1 shows an example of pdf estimation in . The panel (a) shows several estimates of the pdf for different bandwidths on the same data. Another factor that drastically affects the final estimate is the sample size, as highlighted in Figure 1 panel (b).

(a) (b)
(c) (d)
Figure 1: Examples of kernel density estimation. (a) and (b) Kernel estimates under different bandwidths and sample sizes. (c) and (d) Estimated densities in (a) and (b) at the same smoothness level as the true density. Here, “bd” indicates the bandwidth and “N” indicates the sample size.

Our interest in this paper is more on comparing populations rather than just estimating pdf. If we use kernel density estimates and compare them using one of standard metrics, the results will naturally be very sensitive to the choice of bandwidths and sample sizes. In order to make this comparison robust to low sample size and different bandwidth choices, there are several possibilities:

  1. Use a fixed bandwidth. We can fix a bandwidth for all pdf estimates, and then use any function norm (Cha, 2007) for comparison. While this is a convenient strategy, it suffers from the problem that the final answer will strongly depend on the sample size (Marron and Schmitz, 1992) (also illustrated in Figure 1). Different sample sizes can lead to very different pdf estimates when using the same bandwidth despite coming from the same underlying distribution.

  2. Use an adaptive bandwidth. We can use one of bandwidth selection methods (Jones et al., 1996b, a; Bowman, 1984; Scott and Terrell, 1987; Turlach, 1993; Botev et al., 2010) to estimate pdfs and then compare them. However, there is no consensus on which approach works best in general scenarios. Most bandwidth selection methods are based on minimizing the integrated squared error or the mean integrated squared error (MISE), but they often fail in practice because the true pdf that is necessary for calculating these quantities is unknown.

  3. Use a fixed smoothness level. Another solution, coming from a very different perspective, is to focus on the smoothness of the estimated pdfs rather than on the bandwidth, which is the main idea of this paper. We want to quantify the level of smoothness of a pdf as a function and use that in the following way. For any two estimated pdfs being compared, one can bring them to the same level of smoothness, irrespective of their initial bandwidths and sample sizes. Since even the classical estimation theory makes assumption about smoothness of underlying density (Marron and Nolan, 1988; Chaudhuri and Marron, 2000), it is a natural criterion to include in estimation. Furthermore, this property can be easily manipulated, as described later, and provides robustness against the choice of bandwidths and sample sizes. Figure 1 bottom row shows pdf estimates from the top row after they have been processed to equalize their smoothness level (details of this process are presented later). Now these estimates appear far more similar to each other than before, as they should be.

While comparisons of populations are needed everywhere, we consider two important applications. The first application is in computer vision and image analysis, where a variety of image analysis techniques rely on specifying certain features of interest, e.g., Haar

(Viola and Jones, 2001), HOG (Dalal and Triggs, 2005), SIFT (Lowe, 2004) and so on, and comparing differences in estimated densities of these features. The second application is the two-sample hypothesis testing. Any measure of difference between estimated densities is a natural statistic for two-sample test, e.g. Kolmogorov-Smirnov (KS) test (Smirnov, 1948). Such methods depend on kernel density estimates, and the bandwidth parameter strongly influences final results. It will be very useful to develop a metric that depends on something that is more intrinsically related to the underlying densities rather than the bandwidth parameter, and is robust to variability in sample size.

The kernel-based density estimation is essentially a problem of smoothing data. Given a random sample , the empirical density function is given by , where denotes a point mass at . The convolution of with a kernel function gives us an estimated pdf , where denotes the convolution operation. If is the Gaussian kernel with a bandwidth , the convolution process is called Gaussian smoothing or blurring. This smoothing is similar to the Gaussian blur of images (Zhang et al., 2013). As described there, one can study Gaussian blur as a solution of heat diffusion equation with appropriate initial condition. It turns out that the set of all isotropic Gaussian kernels, under all possible bandwidths, form a group. The orbit of density functions under this group action defines an equivalence class; in the current context, it can be viewed as the set of all pdfs estimated from the same data but with different bandwidths. This solution naturally applies to spherical domains also and is therefore a good solution for density estimation and population comparison on a sphere.

The novel contributions of this paper are as follows. (1) Given kernel density estimates on a spherical domain, estimated using Gaussian kernel with arbitrary bandwidths, our framework identifies the equivalence classes to which they belong. It then compares these estimates by comparing their equivalence classes, and thus is robust to the original bandwidth parameter. (2) We define a function that quantifies smoothness of pdfs, and use it to specify the section of action of the blurring group. Two functions are in the same section if they have the same level of smoothness. (3) This framework is applied to develop a two-sample hypothesis test where pdfs estimated from data with arbitrary sample sizes are brought to the same smoothness level, i.e., the same section, and then compared via a manifold distance.

The rest of the paper is organized as follows. In Section 2, we lay out the mathematical foundation of our approach. In Section 3, we apply this framework to the kernel density estimation and population comparison on different spherical domains. In Section 4, we develop a two-sample hypothesis test and in Section 5 we provide a variety of experimental results using simulated and real data.

2 Mathematical Framework

We start by outlining mathematical details of our framework, including Gaussian heat kernel, kernel density estimation, bandwidth selection, and a metric for comparing estimated densities.

2.1 Heat Equation for Density Estimation

Let denote the set of smooth, non-negative functions on a domain , and be the subset of pdfs. That is, , and . Let be the standard Laplacian operator on . In this paper, we consider the compact domains such as (a circle), (a torus) and (a sphere). For each of these compact domains, it is easy to find an orthonormal Hilbert basis of with the property that every basis element is an eigenfunction of the Laplace operator. Extension to Euclidean domains such as and will be also discussed later although the Laplacian operator is defined differently in such non-compact domains. For , we have ; for , for , and for , for the spherical coordinate , where is the polar angle and is the azimuthal angle.

The kernel density estimator based on sample data with is , where is the Gaussian kernel on , and is the bandwidth. If we treat the bandwidth as time, the smoothness of the estimated density will increase as the time increases. Another way to state this is to use the classical heat diffusion equation:


where is the Laplacian operator. The Gaussian kernel used in this paper needs to satisfy this heat equation (Hartman and Watson, 1974) . In Eqn. (1), the value of has the same effect as in the kernel estimate, and therefore, the time parameter in heat diffusion resembles the bandwidth in kernel density estimation. If we set the initial heat to be a given function say (say a kernel density estimate using a bandwidth ), the solution for is also a kernel density estimate with a larger bandwidth for some . For more mathematical details on the heat equation, readers are referred to Lindeberg (1990) and Chaudhuri and Marron (2000).

We represent a smooth pdf on the domain via its coefficients under a complete orthonormal basis set. Assuming that the domain is a compact domain, e.g., , and using the metric on , we define a complete orthonormal Hilbert basis , where each is an eigenfunction of

with eigenvalue

, i.e., . Assuming that is a constant function, we have , and all other s are positive due to the positive definiteness of . Any element can then be expressed as . In practice, we use a basis set of size to make this representation finite. So

is (approximately) represented by a vector

. Note that for a rough density function, one may need a large to more accurately represent the function. We define a mapping , i.e., for . As specified thus far, is a many-to-one map meaning that its inverse is set-valued. However, we will use to denote a specific density function given by (because of the constraint of a density function, we slightly adjust such that ).

The advantage of the chosen basis is that after expressing functions with coefficients under this basis, we can easily express the solution for the heat equation analytically. If is the solution of the heat equation, with the initial heat distribution , then this solution takes the form . Using simple calculus one can verify that the right part of the heat equation is

and the left part of the heat equation is

Therefore, exactly equals to , and is the solution of the heat equation. In other words, is another kernel estimated density with a bandwidth larger than that of , and one can use a vector , where , to represent the .

2.2 Quantify Smoothness Levels using Sections

Any smooth pdf can now be (approximately) represented by an element of . We observe that the set of smoothing parameter in Eqn. (1) has a natural group structure under addition operation (see (Boothby, 2003), Chapter 3), and its action on is given by the mapping :


For , and its finite representation , the orbit under the group action is:


In the kernel density estimation scenario, the group action in Eqn. (2) can be understood as follows. We first use a bandwidth to estimate the density (using a Gaussian kernel) and set the estimate as the initial heat, denoted as (represented by a vector ). For a positive time , is the kernel estimate with bandwidth for some ; for a negative time , is the kernel estimate with bandwidth . The orbit of (defined in Eqn. (3)) is the set of all possible smoothed versions of . It can be deemed as an equivalence class for the purpose of comparing densities.

Orthogonal Section Under Smoothing Action: Under this geometry, the vector space becomes a disjoint union of orbits (equivalence classes). Moving along each orbit, toward the direction of increasing , the kernel estimated densities become smoother and vice-versa. To compare densities, we compare their orbits, i.e., define a distance between these equivalence classes. However, since we do not have any metric under which the group action is by isometries, i.e., the orbits are not parallel, we use the concept of orthogonal section for comparisons. An orthogonal section of under the group action is defined to be a set such that: (1) one and only one element of every orbit in presents in ; (2) the set is perpendicular to every orbit at the point of intersection.

We construct an orthogonal section as follows. First we define a functional by . Using the integration by parts, can be rewritten as . Since relates to the norm of the gradient, it measures the first order roughness of function . Also, since is represented by its coefficients as an element of , it is convenient to rewrite as the mapping given by . In our paper, (because is a constant, see Section 2.1), so the summation starts from . For a positive real constant , we define a section under the blurring group as


Each point in represents a pdf with smoothness level equal to (as measured by the function). By definition, is a set perpendicular to every orbit and, therefore, one can think of as a level set containing pdfs at the same level of smoothness. A formal proof is presented in the Appendix. Since the s are all positive, is actually an -dimension ellipsoid in . A cartoon illustration of the orbit and level set are shown in Figure 2 panel (a).

(a) (b)
Figure 2: (a) Cartoon illustration of geometry of the representation space . ’s represent radial orbits, and ’s represent the ellipsoidal level sets defined in Eqn. (4). (b) Illustration of calculating using Algorithm 1. and are initially estimated densities. and in set have the same smoothness level . denotes the geodesic distance between and . We also can select another smoothness level to calculate their distance .

To help understand these abstract concepts, we use a concrete example. From a random sample , drawn from a density function ), we construct two estimates, using the Gaussian kernel in and different bandwidths , denoted as and . We use their finite representations for analysis ( and ), and let and . lies in the same orbit as , but has a different smoothness level ( if ). Now we want to bring and to the same smoothness level. Without loss of generality, let us assume . In this case we smooth (i.e., increase ) to increase its smoothness level to . The precise amount of smoothing required can be solved by finding a such that:

This is the same as finding the intersection of the orbit with the level set . Due to the monotonicity of the equation with respect to the parameter , we can use the bisection method to solve for . If we have more than two pdf estimates at different smoothness levels, we can always choose a certain smoothness level, say , and bring them all to this level.

2.3 Measure Difference using Geodesic Distance

The next problem is how to quantify the difference between the two estimates after we bring them to the same section. An idea is to use the distance: . However, note that has an ellipsoidal structure in terms of the coefficient vector . A natural way is to treat as a manifold, and quantify differences between points using geodesic distances. Although it is possible to have analytical expressions for geodesics on ellipsoids in low dimensions, these formulas get very complicated as the dimension grows. In this paper, we use a numerical method called path-straightening algorithm (Klassen and Srivastava, 2006) to calculate the geodesic distance on an ellipsoid , and denote it as . Details of this algorithm are presented in the Appendix. Given a numerical tool to compute these geodesic distances, we can now outline the full procedure for comparing any two samples on the domain .

Algorithm 1 (Numerical Calculation of ): Given any two arbitrary kernel estimates , and a smoothness level , the defined is calculated in the following way:

  1. Represent using coefficients under the defined orthonormal basis : , , and let . The orthonormal basis used is discussed in Section 3.2.

  2. Find to bring to the set (orthogonal section) by solving equations:

    Then, let .

  3. Calculate on the ellipsoid using path-straightening algorithm between two points and .

Figure 2 panel (b) illustrates Algorithm 1 in a cartoon form. It shows two orbits and associated with two densities . It also shows the actual densities at different levels of smoothing (), for each orbit.

3 Kernel Density Estimation and Comparison

In this section, we present the complete framework for kernel density estimation, representation and comparison on a unit sphere , and discuss its extensions to .

3.1 Densities on Domain

To apply our framework to densities on

, we need a Gaussian distribution that can be used as the kernel function to estimate densities. In this paper, we focus on

and but the construction can be generalized to any in principle. Our method assumes that the Gaussian kernel used in estimation must be a heat kernel, i.e., the kernel itself is the solution of the heat equation. Hartman and Watson (1974)

pointed out that the widely used Fisher distribution which often plays the role of normal distribution on

is not a heat kernel. The heat kernel Gaussian distribution on the circle is given as:


where is a point on and represents the “north” pole of , is the center of the distribution and controls the variation. We can easily verify that this distribution is a solution of the heat equation. When , we have a -sphere , the Gaussian kernel is defined as:


  • is the area of the sphere , which equals ,

  • , for , are the eigenvalues of the Laplacian on ,

  • is the Legendre polynomial of order for ,

  • is the number of linearly independent homogeneous spherical harmonics of degree in , and

  • indicates the inner product.

Taking as one example, we have , the Legendre polynomial can be expressed using Rodrigues’ formula: , and . So the heat kernel normal distribution on a unit -sphere is:


To estimate the density from a sample on , the kernel density estimation is given as , where is the bandwidth parameter.

To apply our framework, we need to find an orthonormal basis for smooth functions on . For this, we focus on two spheres, with and . For , we use the Fourier basis (): . With the Laplace operator in domain , we have , and thus the eigenvalue of is . For , we use the spherical harmonics basis as follows. Let and be the spherical coordinates (to apply the kernel in Eqn. (6), we need to represent both and in spherical coordinates). The spherical harmonics basis of degree and order is denoted by , where . On the unit sphere , we have , and thus if we rearrange the spherical harmonics in the order of , the corresponding eigenvalue are . With the bases established, each density can now be represented as the linear combination of the basis functions and coefficients, and Algorithm 1 can be applied to compare densities.

3.2 Extension to Euclidean Domains

This method can be easily extended to Euclidean domains such as and for broader applicability. Gaussian kernels satisfying the heat equation are readily available for these domains (Lafferty and Lebanon, 2005). However, there is a technical issue in that these domains are not compact. Even we restrict to intervals such as and , there are some technical problems in directly applying the previously developed framework. Note that to represent a density function in we need an orthonormal Hilbert basis of with the property that , where is one of the basis functions. In terms of this basis, the heat equation can be solved explicitly; by flowing this solution in the time direction, we obtain an action that provides a very natural way to “spread out” Gaussian type functions. If is non-compact or has a boundary, all of this is either impossible, or much more difficult (requiring a choice of boundary conditions).

In this paper, to handle data in domains such as and , we first apply a state-of-the-art kernel density estimator in the original domain, and then detect the boundaries of the estimated densities. With these boundaries, we map the estimated domain to or , and thus wrapping the estimated density onto these spherical domains. To be more specific, for , the Gaussian kernel used is . Let be a sample of -variate random vectors drawn from an unknown distribution with density function . The Gaussian kernel density estimate is . We then detect the boundary of in , wrap the function to , and rescale all estimated densities to this domain. If we have multiple functions, we detect their boundaries simultaneously and select an large interval that encloses all individual boundary as the shared boundary for all functions. According to the final boundary, we wrap all functions to for comparison. For , we use Fourier basis on to represent functions in this space.

4 Two-sample Hypothesis Test

Since measures the difference between estimated densities from two samples, it is a natural statistic for a two-sample hypothesis test. Here we develop a formal procedure using .

Let and be two estimated densities from samples and in using the bandwidth and , and let denotes the finite representation for . The statistic is calculated by the geodesic length between and on the section , for a chosen . Under the assumptions , ,

and null hypothesis


, it is possible to simplify the test statistic

by replacing it with the Euclidean distance (chord length) between and : . Using Parseval’s identity it becomes . For the simplest case, where and , according to Anderson et al. (1994)

, the asymptotic expected value and variance of the test statistic

are given by and , where , , , and is the kernel function.

An asymptotic test may be based on the value of , by rejecting the null hypothesis if exceeds the appropriate critical point. However, even if use the aforementioned simplification, , the distribution of is not clear (Anderson et al., 1994). Furthermore, the explicit asymptotic distribution of the actual statistic (the arc-length on the ellipsoid) is even harder to obtain. Thus, a more practical approach to the perform two-sample test using is to use the bootstrap method. Fixing a value of for the whole experiment and letting and denote independent re-samples drawn randomly with replacement from the pooled sample set , the bootstrap approach is as follows:

  1. Calculate the geodesic distance between kernel estimated densities from the original two samples and on a chosen section , denoted as .

  2. Draw bootstrap samples and , and calculate the geodesic distance between kernel estimated densities from these samples on the section , denoted as . Repeat this procedure many times and obtain an empirical distribution of .

  3. Given (the significance level), if , we reject the null hypothesis.

5 Selection of Tuning Parameter

Given any constant , we can construct an orthogonal section , where denotes the level of smoothness. Thus, it is important for us to choose a proper and the corresponding section for comparing densities.

Let us consider a scenario of comparing two densities and , and their finite representations and . Let and (assume ). We can choose a or even a value outside this interval for evaluating their difference. If we choose a , we need a to bring to , i.e., . However, this process is susceptible to noise because the action is given by . For a negative , this amounts to inflating the coefficients exponentially. Take the basis for as one example, where we have , and can be a large number even for a small . If there is some noise in (which is hard to avoid in real data due to numerical errors), such noise will be amplified after multiplying . Actually, this process is called deblurring in image processing (Liu et al., 2014). Keeping this principle in mind, we propose the following strategies to selected for the two focused applications:

  1. Two-sample hypothesis test. In this case, we have only two samples. We first estimate their nonparametric densities with some initial bandwidths which can be obtained by using one of the automatic bandwidth selection methods. Then, we choose the smaller of two s to be the smoothness level for performing the hypothesis testing.

  2. Comparison of multiple samples: Given a set of samples, we first use one of the automatic bandwidth selection methods to estimate their densities. If training data are available, we will select by cross validation. If training data are not available, we recommend to choose such that the -values (smoothness) of most estimated densities (e.g., ) are larger than the selected .

6 Experimental Results

In this section, we demonstrate our approach on some selected domains using both simulated and real data.

6.1 Simulated Studies on or

Comparing Densities Usins : We first consider the domain . Densities on can also be treated as those on an interval in via wrapping for analysis. We started from two densities, and , shown in Figure 3 (a). We then sampled points from each and estimated densities from samples using the kernel method (bandwidths were selected using the method given in Botev et al. (2010)), with estimates shown in panel (b). These functions were wrapped on to the domain and were represented using basis functions with coefficients. We then manipulated their smoothness levels according to the group action defined in Eqn. (2). Figure 3 columns (c) and (d) show the two estimates after we matched their smoothness levels to and , respectively. The corresponding geodesic distances between these densities are and .

(a) True densities (b) Estimated densities (c) At (d) At
Figure 3: Illustration of using for comparing two kernel estimated densities in . (a) True densities, and . (b) Kernel estimated densities from their random samples (each with samples). (c) and (d) After bringing them to the same smoothness level ( and , respectively). Their distances are and .

Utilizing Known Smoothness to Improve Estimatation: In this paper, we have introduced a function to quantify smoothness of an estimated density function. As we know, for kernel density estimates, the bandwidth also controls smoothness of an estimated density. Here we illustrate the connections and differences of these two ways of governing smoothness. We simulated a density function and sampled points from it. Next, we estimated the density in two different ways: (a) use the optimal bandwidth that minimizes the asymptotic MISE (Scott and Terrell, 1987) to estimate the density, denoted as ; (b) first estimate the density using a smaller bandwidth and then smooth it to the same smoothness level as that of the underlying true density, denoted as . To compare these two estimates, we used the norm to measure the difference between the estimated density and the true density, and the results are presented in Figure 4. One can see that the differences between the estimated density and the true density in both methods converge to zero when the sample size increases. However, the latter solution has smaller error and it converges at a faster rate. Notably, for small sample sizes, one still can get a very good estimate after bringing the estimated function to the correct smoothness level. The reason is that the smoothness (as quantified by the function ) is an intrinsic property of a density function. If any prior information about the smoothness of the true density is available, the proposed framework can more efficiently incorporate this prior into the density estimation.

(a) True density (b) Estimation error
Figure 4: Comparison of density estimation using the optimal bandwidth and optimal smoothness (evaluated by the G-value).

as Test Statistic for Two-sample Test: We are interested in using as a statistic for two-sample hypothesis testing and, further, in investigating the effect of on the power of that test. We performed an experiment where we simulated five pairs of densities and with similar smoothness levels, and sampled points from each density. The norm between these five pairs of functions are , , , , and , respectively. The initial estimates were formed using an automatic bandwidth selection method given in Botev et al. (2010). We then brought them to different pre-specified smoothness levels for testing. Figure 5 (a) shows the results, where the x-axis is the smoothness level , and the y-axis shows the percentage (based on tests) of rejecting the null hypothesis. From the results, we can see that the selection of is important. A big will smooth the estimated densities too much, and therefore, eliminates their difference and reduces the power of the test. Fortunately, in a relatively large range (), we obtain a very good test performance. In panel (b), we show the histogram of smoothness levels of estimated densities (of the 500 runs) using an automatic bandwidth selection method (Botev et al., 2010) for each pair of simulated functions. Following the second procedure of selecting in Section 5, we can select for all the five pairs for the hypothesis testing. This

will result in good performance: a small type I error (see pair 1), and a good test power (see pairs 2, 3, and 4).

(a) Hypothesis test performance (b) Histogram of ’s
Figure 5: Effects of on test performance. (a) value in x-axis and percentage of rejecting null hypothesis in y-axis. (b) Distributions of -values of estimated densities using Botev et al. (2010) in the simulation.

We also performed an extensive experiment to compare our results with other two-sample test methods. We consider two different scenarios in this experiment: (1) a case where the difference between and lies in the tail; and (2) a case where the difference lies in the middle. Figure 6 column (a) shows the simulated densities (the first row shows scenario (1) and the second row shows scenario (2)). We sampled points from each density, and used them for the testing. We compared with four other tests: (i) Kolmogorov-Smirnov (KS) test; (ii) test based on distance between estimated densities using a unit bandwidth (Fix BD) (Anderson et al., 1994) and (ii) the optimal bandwidth (Opt BD) (Botev et al., 2010); and (iv) maximum mean discrepancy (MMD) method (Gretton et al., 2012, 2007) (a Gaussian kernel with a recommended bandwidth by Gretton et al. (2012) was used). The in our method was chosen to be fixed at . In scenario (1), our method outperforms the statistics of KS, Fix BD and Opt BD. When the difference between and

is small, our method has a smaller chance of rejecting the null hypothesis (a lower type II error) compared to MMD and, when the difference is large, our method has a bigger chance of rejecting the null hypothesis (a higher test power) relative to MMD. In scenario (2), the proposed method has a similar test power with MMD, but outperforms KS, Fix BD and Opt BD.

(a) True densities (b) Test performance
Figure 6: Two-sample hypothesis test in . (a) shows one example of the true densities and . (b) shows the test performance.

6.2 Real Data Application in or

The geodesic distance is not only a statistic for the two-sample hypothesis testing but also a metric to quantify differences between non-parametric densities. One potential application is in the computer vision area, where features are extracted and compared using their distributions (Liu and Wang, 2003; Chaudhry et al., 2009; Osada et al., 2002), e.g., histograms. However, the choice of the number of bins has a large influence on the shapes of histograms. Instead of comparing the histograms, a better way of comparing two features is to compare their kernel estimated densities, especially when using the distance .

(a) Original texture images (b) Features (c) Kernel estimated densities
Figure 7: Comparison of features using . Estimated densities in (c) were brought to the same smoothness level.

We performed an experiment involving classification of images to illustrate advantages of in comparing image features. According to Liu and Wang (2003)

, the spectral histograms, which are nothing but marginal distributions of the image after convolving with some filters, can be used to represent and classify texture images. Motivated by this argument, we convolved each texture image with

Gabor filters (Liu and Wang, 2003) of size of , and the corresponding kernel estimated marginal distributions were computed and used as features to classify the texture images. Figure 7 illustrates one example of comparing two texture images using (). Our classification dataset contains texture images from different categories: leaves, food, fabric, buildings, brick, bark. Each category has images, and each image has size of . Some examples of these images are shown in Figure 8. In the classification experiment, we chose these images one-by-one as queries and found their nearest neighbor under the metric mentioned above. If the nearest neighbor image belongs to the same category as the query image, we consider it as a success retrieval, otherwise as a failed one. We performed this process for every image in the dataset. Table 1 shows the classification result. The numbers in the table are number of successful retrievals for each category. We compared our method with other five similarity measures. Assuming and represent two histograms of features, while and represent the corresponding estimated densities, these similarity measures are:

  1. Hist. (20): distance between histograms with bins, defined as .

  2. Hist. (100): distance between histograms with bins.

  3. : distance between estimated densities, defined as .

  4. : distance, defined as .

  5. Bhatt.: Bhattacharyya distance, defined as .

The kernel densities in this experiment were estimated with an automated bandwidth selection method in Botev et al. (2010). For each retrieval, we used a different , which was selected based on the strategy presented in Section 5 (by assuming that no training data are available). From this result, we can see that, the proposed method outperforms the compared similarity measures. Since this classification is only based on features, with more features, one can potentially improve the classification result by adding more features.

Food Leaves Fabric Food
Figure 8: Example images in texture classification dataset.
Categories Hist. () Hist. () Bhatt.
Leaves(9) 3 3 5 5 4 9
Food(9) 6 6 6 7 6 7
Fabric(9) 2 2 3 3 3 3
Buildings(9) 5 6 6 7 7 7
Brick(9) 3 3 5 5 4 5
Bark(9) 4 4 5 5 3 5
Total (%) 23 (42.6) 24 (44.4) 30 (55.6) 32 (59.3) 27 (50.0) 36 (66.7)
Table 1: Classification result of texture images.

6.3 Simulation Studies on

Now we consider the unit two-sphere as the domain of interest. We first compare the kernel densities estimated from different random samples. We drew two sets of samples from two different mixtures of Von Mises-Fisher distributions, with sample size of for each sets. The heat kernel on in Eqn. (6) was used to estimate densities. Note that the summation in Eqn. (6) has to be truncated in practice; only the first (a large integer) terms were kept to get an approximate Gaussian kernel. In the simulation process, a sphere was parametrized using a grid, and a kernel estimated density was fitted using spherical harmonics basis (up to the degree of ). Since the data were simulated from smooth distributions, those spherical harmonic functions are enough to represent the estimated density. If we have a rougher function, more basis elements become necessary. In Figure 9 left panel shows a true density function that we used to sample data, and the estimates with the bandwidths and . Next, we compare with the Fisher-Rao metric, which is defined as for any densities on . The experiment results are shown in Table 2. Here we used for our approach. This experiment shows that the proposed distance is almost constant and not affected by the selected bandwidth, in contrast to the Fisher-Rao distance that changes significantly with the bandwidth used.

Figure 9: Left: Density estimation on under different bandwidths. Right: Two-sample hypothesis test on domain .
Fisher-Rao ()
Bandwidth () Bandwidth ()
Bandwidth () 0.05 0.1 0.15 0.2 0.05 0.1 0.15 0.2
0.05 0.1470 0.1478 0.1485 0.1492 0.4485 0.3857 0.3937 0.4275
0.1 0.1453 0.1461 0.1467 0.1475 0.4727 0.3466 0.2998 0.3015
0.15 0.1440 0.1447 0.1453 0.1459 0.5218 0.3699 0.2847 0.2489
0.2 0.1423 0.1429 0.1435 0.1439 0.5695 0.4089 0.3045 0.2413
Table 2: Comparison of with Fisher-Rao distance.

Next, we used to perform a two-sample hypothesis testing on . We simulated pairs of density functions on with increasing distances and sampled data points from each of them. The bandwidth for density estimation can be selected using a data driven method given in Klemelä (2000). The test results are shown in Figure 9 right panel. We have selected a constant in the experiment (similar to the case). We only compared with the metric since other testing methods (e.g., KS and MMD) are not directly applicable for data on .

6.4 Real Data Application on

An interesting application of our approach on is in analyzing hurricane data for studying patterns of hurricanes. In the result reported here, we used the Atlantic hurricane database (HURDAT2) (Landsea et al., 2015), which contains hurricanes starting from north Atlantic ocean and Gulf of Mexico. The database contains six-hourly information on the location, maximum winds, central pressure and so on, for each of the relevant hurricanes.

First, we are interested in analyzing the location distributions of hurricanes starting from two different regions. In Figure 10, panel (a) shows two sets of hurricanes according to their starting locations (in different colors), panel (b) shows locations of these hurricanes after hours, and panel (c) shows the ending points of them. Using , we can measure the difference between location distributions of these two sets of hurricanes after a certain period of development. We can also perform a two-sample hypothesis testing to see if the hurricane location distributions are different after a certain period development. We randomly chose three pairs of sets of hurricanes and calculated for each pair. Table 3 shows the experiment result, where “1” represents rejecting the null hypothesis and “0” represents failing to reject the null hypothesis (based on the significance level ). All and two-sample hypothesis tests were calculated on the section . From the table we can see that the short-term evolution of hurricanes depends on their starting points; however, as the time lag increases the dependence naturally decreases, and eventually does not depend on the initial locations (e.g., the first and second pair). However, when the initial locations are significantly different, the ending points also are discriminative (e.g., the third pair). The p-values of two-sample hypothesis tests for the three pairs at the ending stage are , and , respectively.

Next, we divided hurricanes starting from the Gulf of Mexico into two categories: (1) hurricanes started in [May, August], and (2) hurricanes started in [September, December], and analyzed their ending points. Figure 11 shows these two categories of hurricanes in yellow and green color, respectively. In Figure 11, panel (a) shows the starting points of these hurricanes, panel (b) shows the hurricane locations after hours and panel (c) shows the ending points of these hurricanes. Our results indicate that the distributions of starting points of these two sets of hurricanes have no statistical difference. But after hours’ development, the distributions of them are significantly different, and the distributions of their ending points are also different. From Figure 11, we can see that most hurricanes proceed along the east coast of America, and hurricanes in [May, August] in generate spread out faster and farther than hurricanes in [September, December].

(a) Starting points (b) After hours (c) Ending points
Figure 10: Location distribution of hurricanes starting from different regions.
Temporal info Starting 6-hour 12-hour 18-hour 24-hour 30-hour 60-hour Ending
First pair 0.1574 0.1527 0.1476 0.1423 0.1370 0.1318 0.1111 0.0341
Test 1 1 1 1 1 1 1 0
Second pair 0.2331 0.2253 0.2176 0.2095 0.2020 0.1948 0.1682 0.0416
Test 1 1 1 1 1 1 1 0
Third pair 0.4512 0.4403 0.4277 0.4169 0.4035 0.3915 0.3419 0.1417
Test 1 1 1 1 1 1 1 1
Table 3: Comparison of hurricanes starting at different locations on section (1 - reject the null hypothesis; 0 - fail to reject the null hypothesis).
(a) Starting points (b) After hours (c) Ending points
Figure 11: Distribution of two sets of hurricanes in Gulf of Mexico. Hurricanes between [May, August] are marked in yellow and hurricanes between [September, December] are marked in green.

7 Summary

We have introduced a framework for metric-based comparison of densities that have been estimated using an isotropic Gaussian kernel. This comparison is based on quantifying the smoothness levels of density functions and bringing them to the same level before performing comparisons. The quantification and manipulation of the smoothing levels of pdfs are built on an action of a smoothing group on the space of functions. This action is implemented with the help of the heat equation whose solutions correspond to a Gaussian isotopic smoothing of an initial function. A section of this action is a set of all functions that have the same level of smoothness and this set can be identified with an ellipsoid. Geodesic distances on this ellipsoid provide a measure for comparing estimated densities. We use this framework to derive a two-sample hypothesis test using geodesic distance as a test statistic and bootstrap method for approximating distribution for this test statistic. Through a variety of experiments and studies involving both real and simulated data, we test the validity of this approach on several domains including a unit circle, a unit interval, and two-dimensional unit sphere. It is observed that the task of bringing estimated densities to the same smoothness level reduces the effect of bandwidth and/or sample size on density comparisons and significantly improves the test results.

8 Appendix

8.1 Proof that is an Orthogonal Section

An orthogonal section is a subset of (coefficient representation of densities) under the action of the group (defined in Eqn. (3) in the main paper) if: (i) one and only one element of every orbit in presents in