Nowadays, many statistical applications study non-Euclidean data. Typical data examples include: symmetric positive definite (SPD) matrices (Smith et al., 2013), the Grassmann manifold (Hong et al., 2016), the shape representation of corpus callosum (Cornea et al., 2017)
, samples of probability density functions in Wasserstein spaces(Petersen et al., 2021), palaeomagnetic directional data in spheres spaces (Scealy and Wood, 2019).
. A common strategy is to embed non-Euclidean data objects into a Euclidean or metric space before the analysis. When the non-Euclidean data objects can be embedded in a metric space, but not a Euclidean space, metric (distance)-based methods can then be applied. Many methods exist or are being developed in statistics and machine learning; e.g.,Friedman and Rafsky (1983); Székely and Rizzo (2004); Székely et al. (2007); Heller et al. (2012); Balakrishnan et al. (2013); Lyons (2013); Chen and Friedman (2017); Pan et al. (2018); Shi et al. (2018); Dubey and Muller (2019). Assessing the uncertainty following the use of the existing methods to analyze non-Euclidean data is important, but there does not exist a theoretical foundation for statistical inference. For example, does a function of metric play a central role in metric-based inference similarly to the distribution function (DF) in Euclidean spaces?
In statistical inference, DF relates theory to the real world, allowing us to draw conclusions from the data (Efron, 1979)
. The DF is defined to uniquely determine the Borel probability measure of a random vector (or a scalar) according to the correspondence theorem(Halmos, 1956)
. Given observed data, the DF can be well estimated by the empirical distribution function (EDF). As illustrated in Figure1, the DF and observed sampled data are linked to form a directed closed loop by the correspondence theorem in measure theory and the Glivenko-Cantelli theorem in statistical inference. This connection creates a paradigm for statistical inference.
Studying the properties of EDF and further the empirical process has a major area of mathematical statistics (Shorack and A., 2009), because many statistical procedures can be represented as functionals on the EDF. Examples include Kolmogorov-Smirnov test (Darling, 1957) for the equality of two unknown DFs and Hoeffding’s independence test (Hoeffding, 1948) for two random samples of data.
In this paper, we aim to introduce a quasi-DF to serve as the cornerstone of nonparametric statistical inference for metric space-valued data objects. We consider several important problems in statistical inference to show the utility of the quasi-DF. Note that the DF in a Euclidean space has the correspondence theorem because its definition is closely relative to the Euclidean metric topology. Indeed, the DF is the Borel probability measure of the Cartesian products of left closed rays, which is a base of the Euclidean metric topology. While a metric space is equipped with a naturally metric topology that contains all open balls as a base, balls are generally not ordered, but concentric balls are. This ordering is essential for us to define the DFs in the same way as the ordered topology of one-dimensional Euclidean spaces provided that we fix the center first. Using this center as the second variable, we can define metric distribution function (MDF) in metric spaces as the counterpart of DF in Euclidean spaces (Figure 1).
The rest of this article is organized as follows. We introduce the concepts of the MDF and the empirical MDF (EMDF) in Section 2, and present their theoretical properties in Section 3. In Section 4, on the basis of the MDF and the EMDF, we develop several nonparametric statistical inference procedures and statistical learning methods. To demonstrate the MDF’s effectiveness in practice, we employ the MDF-based methods on the synthetic and real-world datasets in Sections 5 and 6, respectively. Finally, we summarize our work for the MDF in Section 7. Technical proofs and some properties of EMDF are deferred to the “Supplementary Material”.
2 Metric distribution function and empirical metric distribution function
An order pairis called metric space if is a set and is a metric or distance on . Many spaces we have encountered are metric spaces. Examples include Euclidean space, Banach space and connected Riemannian manifold. A metric space is called separable if it has a countable dense subset for the metric topology. A metric space is said to be complete if every Cauchy sequence converges in . A complete separable metric space is sometimes called a Polish space. Given a metric space , let be the closed ball with the center and the radius , be the open ball and be the sphere.
If , are metric spaces, let be the Cartesian product of , denoted by . For any and in , we can define a metric vector on the product space :
We also define be the joint ball on the product space for a center vector and a non-negative radius vector . For this product space, we can also assign a metric such that it is a metric space. For example, if we define
where and means the norm in . We can verify that is a metric on .
Given a point , is called the projection on if . For a set , we also define .
2.2 Metric distribution function and empirical metric distribution function
Let be a (Borel) probability measure associated with an ordered -tuple of random objects taking values in . Denote the indicator function by and the radius vector for . We first define the metric distribution function (MDF) of on that is the foundation of our proposed framework. For , let
Given a probability measure , we define the metric distribution function of on : ,
Suppose that are samples generated from a probability measure on a product metric space . We define the empirical metric distribution function (EMDF) associated with by the following formula naturally:
3 Theoretical analysis of metric distribution function and empirical metric distribution function
In this section, we first discuss some sufficient conditions for reconstructing probability measures from the MDFs and exhibit the properties of the convergence of the EMDFs. Additional basic properties of the EMDF are presented in the second part of the “Supplementary Material.”
3.1 Fundamental reconstruction theorems of MDF
Here we investigate whether a Borel probability measure on a separable metric space can be uniquely determined by the MDF . We shall see that the answer depends on both the probability measure and metric space. For separable metric spaces, Federer (2014) introduced the following geometrical condition on the metric, named directionally -limited, to characterize the correspondence property of the MDF.
Definition 2 (Federer (2014)).
A metric is called directionally -limited at the subset of , if , , is a positive integer and the following condition holds: if for each , such that whenever (), with
then the cardinality of is no larger than .
This concept of “directionally -limited” is essential to our reconstruction theory. We illustrate this condition in Figure 2. We examine a few examples to understand the implication of this condition. First, if with the norm is a Banach space, then the above definition implies
thus is equivalent to
If is a finite dimensional Banach space, owing to the compactness of the unit sphere in , there exists a suitable for each such that the condition of directionally limited metric space holds, which is illustrated in Figure 2.
Another interesting special case is when is a Riemannian manifold (of class ) and is any compact subset of . Let be a normal ball of and be the exponential map, such that the restriction of to the ball of radius has the Lipschitz constant , and , then
Thus, if we associate each with the direction of the shortest geodesic from to , then by the compactness of the set of all unit tangent vectors at the points of there exists a suitable for each .
The last but important case is when is the metric space of binary phylogenetic tree with leaves (Billera et al., 2001), where is fixed. The space is a Polish space and cubical complex (Lin and Müller, 2019). Let be the -packing of such that, for any , the geodesic distance . Denote , for and , the space satisfies
for some constant . This implies that the whole space is directionally-limited with .
Next, we provide an example of metric space that is not directionally limited. An infinite orthonormal base in a separable Hilbert space is not directionally -limited. Let , and , then by the above discussion for Banach space, we have
for all and the cardinality of is infinite.
In a Euclidean space, two Borel probability measures if and only if their associated random objects and share the common DF by the correspondence theorem (Halmos, 1956). This correspondence lays the theoretical foundation for statistical inference. However, DF depends on the linear structure and the order of real numbers. We do not have this structure in a general metric space, and DF can no longer be defined. The following theorems delineate how MDF overcomes this major challenge. Theorem 1 shows that and share the same DF for each location if and only if .
Theorem 1 (The fundamental correspondence theorem of MDF).
Suppose that is a Polish space and denote for two given Borel probability measures and with their respective supports and on . Then with (or ) implies if the metric is directionally -limited at and .
Theorem 1 ensures that the MDF has a one-to-one correspondence with a probability measure when the metric is directionally -limited at the support set of the probability measure. The conditions of Theorem 1 may not be satisfied if is a separable Hilbert space of infinite dimension. Corollary 1 presents reasonable conditions on the measure or on the space so that the the probability measure still can be determined by MDF in infinite dimension space.
(a) Measure Condition: For , there exists such that (or ) and the metric is directionally -limited at . Then with (or ) implies .
(b) Metric Condition: There exist which are the nondecreasing subsets of , where each is a Polish space satisfying the directionally-limited condition and their closure . For every , is unique such that and if . Then with (or ) implies .
Corollary 1 includes separable Hilbert spaces as a special case. For example, a random function, or a random curve in a separable Banach space with unconditional Schauder base functions , can be expanded as where the probability measure of the coefficients , denoted as , satisfies a sparse condition . This setting is similar to the sparse priors used in Castillo et al. (2015) and O’Hara et al. (2009), but it is important to note that we allow the underlying function space to be infinite dimensional. In this example, the “measure condition” of Corollary 1 is satisfied. The metric condition of Corollary 1 implies that if and are two Borel probability measures on , and they share the common metric distribution function on any finite subspace of , then we have . According to the previous statement, many metric spaces, including the space of smooth functions, Riemannian manifold space, shape space, -leaves binary phylogenetic spaces (see Figure 1), satisfy the conditions of Corollary 1.
In general, if the metric space is not a linear space, the geometric condition of “directionally -limited” cannot be induced from the compactness. Davies (1971) gives a counter example that there exists a compact metric space and two distinct Borel probability measures and on , such that and agree on all closed balls.
Next, we extend the 1-1 correspondence theorem to product metric spaces. This extension is challenging because the topological structure of a product metric space may not be as simple as that of a Euclidean space. For example, the product of two circles is topologically not a sphere anymore. Let be two Borel probability measures on , and
We have the following fundamental reconstruction theorem of the joint metric distribution function in a product metric space.
Theorem 2 (The fundamental reconstruction theorem of joint MDF).
Given two Borel probability measures and on a product Polish space , let . If and are both convex combinations of elements in and , then with (or ) implies if is directionally- limited at .
Similar to Corollary 1, we have the following Corollary for the product space.
(a) Measure Condition: For , there exists such that (or ) and the metric is directionally -limited at .
If and are both convex combinations of elements in and , then with (or ) implies .
(b) Metric Condition: There exist which are the non-decreasing subsets of , where each is a Polish space satisfying the directionally-limited condition and . For every , is unique such that and if . If and are both convex combinations of elements in and , then with (or ) implies .
3.2 Main properties of EMDF
Here, we provide EMDF’s Glivenko-Cantelli property and Donsker property. First, we define the collection of the indicator functions of closed balls on :
The uniform convergence property of EMDF is given as follows.
Theorem 3 (The Glivenko-Cantelli type property of EMDF).
Let be a product space and be a probability measure on it. Suppose that is a sample of i.i.d observations from . Define . If satisfies that
where is the cardinality of a set, we have the Glivenko-Cantelli property of our empirical metric distribution function:
The conditions of Theorem 3 are often satisfied in practice. The first example is with the -norm () and is an arbitrary probability measure. The second example is that includes all smooth regular curves in the Euclidean space or a sphere in with the geodesic distance and is an arbitrary probability measure. The third example is that is a set of polygonal curves in with the Hausdorff distance for the Fréchet distance (Driemel et al., 2019) and is an arbitrary probability measure. Another example is that is a separable Hilbert space with a probability measure with support on a finite dimensional subspace.
Based on the two reconstruction theorems, whether two probability measures are identical depends on whether their MDFs are the same over their support sets only but not the whole space. This leads us to consider the Glivenko-Cantelli type property for the MDF over the sample set, because the sample set contains the information of the support of the underlying unknown probability measure.
Theorem 4 (A concentration inequality of EMDF).
Let be a product space and be a probability measure on it. For each , there exists a universal constant such that for all , we have
which leads to
Without restriction on metric spaces and probability measures, Theorem 4
shows that the EMDF has the concentration phenomenon at an exponential convergence rate for a sufficiently large sample. This result of uniform convergence over the sample set is essential when we apply the EMDF to analyze data objects in metric spaces as it is the analysis of data in a Euclidean space. The other important convergence property of the EMDF is the convergence in distribution, called the Donsker property, which is similar to the central limits theorem.
Theorem 5 (The Convergence of Metric Distribution Process).
Let be a product space and be a probability measure on it. Define
If is a VC class with VC-dimension , then we have the Donsker property of the metric distribution process: converges in distribution to a Gaussian process , with zero mean and the covariance function:
For each , is the finite dimensional Banach space with the -norm .
is a sphere or a curve in the finite dimensional Euclidean space with the geodesic distance.
is a set of polygonal curves in with the Hausdorff distance or the Fréchet distance (Driemel et al., 2019).
4 Metric distribution Function based statistical methods
In this section, we discuss how to use the MDF to conduct statistical inference in a few important and common problems.
4.1 Homogeneity test
A common and basic hypothesis testing problem in statistical inference is whether two samples are generated from the same distribution. Suppose we have data objects from two unknown Borel probability measures and on a metric space and need to check whether they are homogeneous, i.e., testing .
We introduce a Kolmogorov-Smirnov type homogeneity measure based MDFs. Considering as the distribution function of for fixed and (similarly, for ), we use Kolmogorov-Smirnov divergence to evaluated the distinction of and . Then, by integrating the Kolmogorov-Smirnov divergence for all , we define metric Kolmogorov-Smirnov (MKS) divergence between and as
We symmetrize MKS divergence and define
Theorem 1 implies that if if and only if .
We can estimate MKS with pairwise distances as follows. Let be a set consisting of , and be another set consisting of . and can be estimated on the basis of EMDF:
By Theorem 4, is a consistent statistic of . We suggest to recruit for two-sample test and use permutation to approximate the -value.
4.2 Independence test
Another fundamental problem in statistical inference is to test the mutual independence among several elements of a random object. Suppose is a vector of -tuple random objects () on a metric space , in which is associated with probability measure , and is associated with probability measure on for . The study of mutual independence is formulated as testing .
It is very convenient to utilize the MDF to measure the mutual dependence because of the definition of the MDF in product metric spaces. Following the Hoeffding dependence paradigm (Hoeffding, 1948), we integrate the difference between the joint MDF associated with and the product of marginal MDFs associated with ’s. We then obtain our metric association measure:
When , is the square of ball covariance in Pan et al. (2020). When the entries of are dependent, then Theorem 2 implies that . Suppose that are observations of associated with the Borel probability measure . The consistent estimator for is given by
The consistency of is guaranteed by Theorem 4
. Under null hypothesis, according toChang and others (1990) and Theorem 5, the asymptotic distribution of is the form of the integration of the quadratic of a zero-mean Gaussian process. Then by Kuo (1975), we can derive that asymptotically converges to a mixture distribution. This coincides with the results derived in Pan et al. (2020). To test mutual independence, we can approximate the -value by permutation.
5 Monte Carlo Studies
We demonstrate how to use the MDF for analyzing data objects in metric spaces, specifically SPD matrix data and shape data. Both of them have intrinsically non-linear topological structures and are frequently encountered in modern statistical researches. For instance, diffusion tensor image and brain connectome are generally formulated as SPD matrices(Dryden et al., 2009; Smith et al., 2013), and DNA molecule and corpus callosum are studied by statistical shape data analysis (Dryden and Mardia, 2016).
Here, we simulate data based on a real connectome data that contain 111 nodes and a corpus callosum characterized by 50 two-dimensional landmarks (see Figure 4). Mathematically, the connectome and the corpus callosum are formulated as a 111-by-111 partial correlation matrix and a 50-by-2 matrix, respectively (Smith et al., 2013; Huang et al., 2015). We first specify the method for generating random SPD based on the connectome, which is denoted as a matrix . Suppose the Cholesky factorization of is matrix such that . We perturb the matrix
by adding a sparse random matrixwhose non-zero entries are random variables , and we generate a random SPD matrix for our simulation study.
Next, we simulate shape objects based on the corpus callosum in four steps. In the first step, approximate the curvilinear abscissa and ordinate of the closed outline of the corpus callosum with two Fourier series:
where is the perimeter of the closed outline of corpus callosum and are constant harmonic coefficients with closed-form expressions. Such an approximation is known as the elliptic Fourier method (Kuhl and Giardina, 1982). Second, a series of random variables , as perturbations, are multiplied on the harmonic coefficients to obtain Fourier series with randomness:
Third, given and , we reconstruct a shape
via the inverse elliptical Fourier transformation(Ferson et al., 1985). Finally, is appended with scale and rotation effects by multiplying a matrix to attain a shape object , where with , and .
We now introduce the distance measure. We utilize Cholesky distance (Dryden et al., 2009) to measure the difference between two SPD matrices , which is defined as , where is Frobenius norm and is the Cholesky decomposition. For the similarity measure between two shape objects , we use Riemannian shape distance in the classical statistical shape analysis (Dryden and Mardia, 2016). The definition of Riemannian shape distance is given by: , where is matrix trace operator, is the set of all orthogonal matrices, and is the preshape of () after removing the translation and scale of original shape object. To compare the performance of different choices, we also consider the affine-invariant Riemannian (AIR) distance for two SPD matrices (Dryden et al., 2009) and Euclidean distance for two shapes. The AIR distance is defined as in which is a matrix operator taking logarithm transformation on the diagonal elements, while the Euclidean distance of two shapes is defined as .
We introduce the simulation settings for evaluating our tests of homogeneity and mutual independence. To this end, let and
represent two independent random objects associated with two Borel probability distributions when testing homogeneity, anda tuple of random objects sampled from a Borel probability distribution when testing mutual independence. Let be a sparse random matrix with three non-zero entries whose indices are randomly selected from the lower triangular part of matrix. Let
be the 5% quantile of the absolute value of. The settings for testing homogeneity are:
SPD matrix: and
follows the Cauchy distribution;
Shape: are uniformly drawn from the interval , and the other parameters for generating random shapes are uniformly drawn from the interval . Let , .
And the settings for testing mutual independence are:
SPD matrix: , ;
Shape: are i.i.d. Bernoulli random variables, where with expectation , with expectation 0.5. , , , where , , and are uniformly drawn from the interval , and the other parameters for generating random shapes are uniformly drawn from the interval .
We study the empirical power of the proposed tests when: (i) the distribution distinction varies but is fixed, and (ii) the sample size increases but is fixed. We compare our proposed method with energy distance (ED) (Székely and Rizzo, 2004) and graph-based test (GBT) (Chen and Friedman, 2017) for two sample homogeneity test with . Specifically, for the GBT, the similarity graph of the pooled observations of the two groups is constructed to be a five minimum spanning tree from the pooled pairwise distance matrix following the suggestion of Chen and Friedman (2017). For testing mutual independence, we compare our proposed test to the total multivariance (TM) method (Böttcher et al., 2019). The significance level is fixed at 0.05. We use 399 permutation replications to derive -value. 500 Monte Carlo runs are performed to estimate the power. The results are presented in Figures 5 and 6.
From the upper panel of Figure 5.A, except using Euclidean distance, the power of all of the three methods increases when the gap of two distributions enlarges; in addition, their power approaches to the significance level as the gap closes. Due to its robustness, MKS outperforms both ED and GBT when all of them use the same distance for the synthetic SPD matrices. On the other hand, GBT is superior to the others in detecting shape differences. Interestingly, all tests that use Cholesky distance are better than those using the AIR distance in this simulation setting, and this is more obvious for testing the shape data. The shape distinction between shape objects is accurately characterized by Riemannian distance while Euclidean distance is totally “fooled” since it does not take into account of the scale-invariant and rotation-invariant properties of shape objects.
We can see from Figure 5.B that, for the shape objects, the power of the three tests improves as the sample size increases when Riemannian distance is used. For the SPD matrix data, when the sample size increases, the power improves for both ED and MKS tests; however, the power of GBT decreases when Cholesky distance is used and after exceeds 60. The empirical results suggest that MKS is a robust and powerful homogeneity test method.
Figure 6 dispays the results of the mutual independence test. From Figure 6.A, the power functions of two tests monotonously increase to 1 as the dependence strength increases. More precisely, MA is close to TM for shape datasets, and more powerful than TM for SPD matrices datasets. Notably, when random objects are mutually independent (), the empirical power of the two tests are around the nominal significance level. Again, Cholesky distance leads to slightly better-powered tests, and the improvement from Riemannian distance over Euclidean distance is visible.
From Figure 6.B, the empirical power of the two tests increases as the sample size increases in both SPD matrices and shape datasets except when Euclidean distance is used. In addition, MA is better than TM in the SPD matrices datasets, as well as in the shape datasets when . It is noteworthy that the simulation setting for the shape data is a typical example of pairwise independence and with mutual dependence. Both TM and MA are powerful in identifying mutual dependence of shape data.