Measuring spatial uniformity with the hypersphere chord length distribution

04/12/2020 ∙ by Panagiotis Sidiropoulos, et al. ∙ UCL 0

Data uniformity is a concept associated with several semantic data characteristics such as lack of features, correlation and sample bias. This article introduces a novel measure to assess data uniformity and detect uniform pointsets on high-dimensional Euclidean spaces. Spatial uniformity measure builds upon the isomorphism between hyperspherical chords and L2-normalised data Euclidean distances, which is implied by the fact that, in Euclidean spaces, L2-normalised data can be geometrically defined as points on a hypersphere. The imposed connection between the distance distribution of uniformly selected points and the hyperspherical chord length distribution is employed to quantify uniformity. More specifically,, the closed-form expression of hypersphere chord length distribution is revisited extended, before examining a few qualitative and quantitative characteristics of this distribution that can be rather straightforwardly linked to data uniformity. The experimental section includes validation in four distinct setups, thus substantiating the potential of the new uniformity measure on practical data-science applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Uniformity is universally recognised across scientific domains, being used in a wide range of applications, since it is connected with several semantic data characteristics. Examples include but are not limited to (1) aggregating points in multidimensional feature space (in which case uniformity suggests the lack of distinctive features), (2) concatenating uncorrelated Gaussian variables on a single vector (which may generate uniform points on a hypersphere through normalisation

[1]

and (3) tuning multiple hyperparameters during algorithm evaluation, in cases that the hyperparameters number prohibits the brute-force testing of all parameter combinations (so as to avoid over-representation or under-representation of regions on the hyperparameter space). On the other hand, uniformity is considered ”common knowledge” and is rarely cited. The underlying intuitive definition of uniformity is as follows:

a pointset defined on a space is uniform if-f it is the output of a stochastic process defined in in which all

have equal probability

to be generated
.

The main issue with this definition is that it imposes a large (often infinite) number of probability equalities, which are both theoretically and practically challenging to fully confirm without a priori knowledge of the stochastic process. As a result, most of the times uniformity is confirmed through reductio ad absurdum

reasoning; a set of reasonable non-uniform distributions are examined and disproven, thus implying uniformity as the only valid option. Perhaps the most typical approach is to aggregate all probability equalities to a small number of subset probability equations, based on the fact that only in a (spatial) uniform distribution the probability ratio equals to the size ratio, i.e.

, where is the probability of a point generated to a subset of , with corresponding size . Especially if is segmented to a family of non-overlapping equal-sized sets the absolute frequency of all

is expected to be equal iff the point distribution is uniform. In practice, this approach suffers from three main shortcomings: (a) the optimal number of subsets as well as their boundaries is not trivial to estimate, especially in cases that unmodelled symmetry properties may cause erroneous uniform identification (b) segmenting sets becomes increasingly problematic in high-dimensional spaces due to the ”curse of dimensionality

[2] (c) the output of the assessment is a logical variable (”true” or ”false”) while no quantitative evaluation is conducted.

A more direct approach to examine uniformity is through the use of spherical harmonics [3]. Spherical harmonics are a complete set of orthogonal functions on the hypersphere that model both uniformity and symmetry. A spatial distribution on a hypersphere being dominated by the spherical harmonic of degree (which corresponds to the uniform part of the distribution) may be declared uniform. However, despite their elegant and mathematically rigid modelling of uniformity, the generalisation of spherical harmonics to higher dimensions greatly expands the number of spherical harmonics even of low degree. As a matter of fact, the number of spherical harmonics of degree in dimensions is [4]. Hence, the number of spherical harmonics of degree is linear in -dimensional space, quadratic in -dimensional spaces, cubic in -dimensional spaces, etc. This makes impractical the use of spherical harmonics even for small values.

The foundation of the present work is a novel uniformity definition, one that is equivalent to the ”classical” one, but can lead to additional tools to examine uniformity: a pointset defined on a space is uniform if-f it is the output of a stochastic process in which the limit set of generated points includes an equal number of all . In the above statement, the phrase ”limit set of generated points” refers to a set that contains infinitely more points than . The novelty of this definition is that it is based on the absolute frequency of the generated points and not the probability as the classical uniformity distribution. The two are obviously equivalent because the limit at infinity of the absolute frequency is the probability.

The main gain is that the new definition implies a connection of the uniformity distribution with the chords connecting points of . More specifically, since the limit uniform set includes an equal number of all points, the limit distribution of point distances is the distribution of the chord lengths of . If is the metric of then its chord length distribution can be examined and formalised, thus modelling the distribution that uniform point distances follow. Subsequently, the similarity of pointset distance distributions with the theoretic chord length distribution can be used to qualitatively and quantitatively assess uniformity.

This work presents such an analysis, for the special case that is a hypersphere of dimension and is the Euclidean distance. This case is very useful from a practical point of view because it corresponds to points normalised to have a fixed Euclidean norm (usually equal to ), a data structure that finds extended applications on data science. Apart from the novel uniformity definition, the main novelties of this work are:

  • The closed-form expression and the basic properties of the hypersphere chord length distribution and a corresponding analysis for the hyper-hemisphere chord length distribution

  • The introduction of the basic principles of measuring uniformity using the hypersphere chord length distribution, including a preliminary experimental evaluation on both real and synthetic data

  • The introduction of the basic principles of detecting uniform hyperspherical subsets in high-dimensional data, including a preliminary experimental evaluation

The rest of this work is structured as follows. The related work on estimating closed-form expressions of chord length distributions is summarised on Section II, while the hypersphere and hyper-hemisphere chord length distributions are presented and thoroughly examined in Section III. The theoretic analysis of how this can be used to assess uniformity and detect uniform subsets is conducted on Sections IV and V, respectively, while the related experimental evaluation follows on Section VI. Section VII concludes this article.

Ii Chord Length Distributions

The study of chord length distributions is part of stochastic geometry, a domain historically being a sparse set of intuitive mathematical puzzles (such as the Buffon’s clean tile and needle problems [5], [6]), which has recently significantly advanced both theoretically and practically [7], the latter including applications in image analysis (e.g [8]

, computer vision (e.g.

[9]

), etc. Within stochastic geometry, the chord length is defined as a random variable, more specifically, the random variable that is equal to the distance

of two points , randomly (i.e. uniformly) selected from a space . The chord length distribution models this random variable, and as already mentioned, it is also the limit distribution of the inner distances of a uniformly selected set on .

Despite this association, there is not a lot of work that has been done in the direction of estimating closed-form expressions of chord length distributions. Currently, this challenging problem has found solutions in very specific cases, usually related to -dimensional or -dimensional spaces used in radiation research [10]. Examples of shapes for which the chord length distribution is known is a regular polygon [11], a parallelogram [12], a cube [13], a hemisphere [14], etc.

The literature of closed-form expressions of chord length distributions in high-dimensional spaces is even more sparse, including the chord length distribution of points inside a hypersphere [15] (which is different than the distribution on a hypersphere that is presented here), in two adjacent unit squares [16]

the chord length distribution of N-dimensional points of variables following Gaussian distribution

[17] and an analysis regarding specifically the average chord length in a compact convex subset of a n-dimensional Euclidean space [18].

Characteristic of the limited interest in high-dimensional chord length distributions is the fact that while J. M. Hammersley introduced the chord length distribution of points selected within a hypersphere in 1950, the corresponding chord length distribution for points selected on a hypersphere became available on a preliminary self-printed version of this work more than decades later [19], based on the recently estimated closed-form expression of the surface of a hyperspherical cap as a fraction of the total hypersphere surface [20]. In the present version the hypersphere chord length distribution estimation is repeated in a more compact presentation, augmented by the corresponding analysis for hyper-hemispheres. Moreover, the introduced distributions are not merely presented as mathematical achievements but are subsequently employed in a novel approach that both quantitatively and qualitatively assess spatial uniformity.

Iii Chord length distributions on the hypersphere

Iii-a Hypersphere chord length distribution

Let be points selected uniformly and independently from the surface of a -dimensional hypersphere of radius , i.e., . The pairwise Euclidean distances of , generate a set of distances (). The hypersphere (or N-sphere) chord length distribution is the distribution of as (i.e. ) tends to infinity.

If , then the N-sphere is a circle. The circle chord length distribution is a special case, for which both the pdf () and the cdf () can be found in the literature (e.g. [21]):

(1)
(2)

The estimation of the closed-form expressions for the pdf and the cdf in the general case (i.e. and , , respectively) is assisted by the hypersphere homogeneity, i.e. the fact that the hypersphere (and its chord length distribution) is invariant to axis rotation. Therefore, the hypersphere chord length distribution can be estimated assuming that one chord end is fixed to , while the other end determines the chord length. An additional consequence of the rotation invariance is that () is not only the asymptotic pdf (cdf) of but also the asymptotic pdf (cdf) of the distances from any fixed point in the point set , i.e. that when tends to infinity each row (and column) of the distance matrix would follow distribution.

Assuming that one of the end points of the chord are in , the chords of length lie on a -sphere of radius . This is derived by eliminating from the N-sphere equation and the distance-from- equation (). The -sphere is the intersection of the

-sphere with the hyperplane

. Since , for all points of the N-sphere with distance from , , and for all points of the N-sphere with distance from , , . Therefore, cuts the hypersphere into two parts, each defined by the comparison of the chord length with . A hyperspherical cap, by default, is a hypersphere part cut by a hyperplane, hence, the latter parts are hyperspherical caps, i.e.

Proposition III.1.

The locus of the -sphere points that have distance , from a point on it is a hyperspherical cap of radius .

Proposition III.1 implies that the cdf is given as the ratio of a hyperspherical cap surface to the hypersphere surface. Before estimating it is reminded that for each N-sphere point with there is a point for which , and vice versa. As a result:

(3)

Due to Eq. (3), only for (i.e. corresponding to hyperspherical caps less or equal than a hemi-hypersphere) is required. This part of the cdf is estimated using the surface of a hyperspherical cap that is smaller than a hyper-hemisphere [20]:

(4)

In Eq. (4), is the hypersphere dimension, its radius, the hypersphere surface, the colatitude angle [20] and the regularised incomplete beta function [22] given by

(5)

In order to eliminate the colatitude angle from Eq. (4), we use the fact that , where is the cap height. Since the maximum distance the cap radius and the cap height form a right triangle (Fig. 1), the height of the cap is . Therefore, and the cdf is as follows:

Proposition III.2.

The cumulative distribution function of the

-sphere chord length, is

(6)
Fig. 1: A hyperspherical cap and the relation of the maximum distance from a point , the hyperspherical cap height and its radius .

The corresponding pdf is:

Proposition III.3.

The probability density function of the

-sphere chord length, is:

(7)

Iii-A1 Basic properties of the hypersphere chord length distribution

Table I summarises the chord length distributions of hyperspheres of dimension to , while the probability density functions and cumulative distribution functions for are shown in Figs. 2 and 3, respectively.

N pdf cdf Mean Median Variance
2 R
3 R
4 R
5 R
6 R
TABLE I: Basic properties of the N-sphere chord length distribution for .
(a) (b)
(c) (d)
(e) (f)
Fig. 2: The cumulative distribution functions. (a) (b) (c) (d) (e) (f) .

The moments about the origin

are estimated by using the transform , which leads to the following equation:

(8)

Hence, for the mean, the following holds:

Proposition III.4.

The mean value of the -sphere chord length distribution is

(9)

On the other hand, can be proven to be independent from the hypersphere dimension . Indeed, Eq. (8) for becomes:

(10)

where is the Gamma function. Using the following Gamma function property [23]:

(11)

and substituting in Eq. (10) it follows that . If a point distribution in space is considered as a ”spatial stochastic signal”, then would correspond to the signal power. The independence of from the hypersphere dimension signifies that the ”power” of the uniform distribution on a hypersphere is constant in all hyperspheres of equal radius, independently of their dimension.

The variance is straightforwardly estimated by and :

(12)
(a) (b)
(c) (d)
(e) (f)
Fig. 3: The probability density functions. (a) (b) (c) (d) (e) (f) .

Apart from , independent from the dimension is also the median score. By substituting to 3 it follows that . is the distance between a ”pole” and the ”equator”, thus this property is intuitively expected, since it follows by the fact that the two hyper-hemispheres have equal number of points.

Finally, a secondary contribution of the hypersphere distribution is that it allows to estimate the generic solution of the Bertrand problem [24], which refers to the probability of a random chord being larger than the radius . By substituting in Eq. (6) we get that:

(13)

is independent from the radius and rapidly decreasing with respect to the dimension . for is , , and , respectively.

Iii-B Hyper-hemisphere chord length distribution

Apart from the chord length distribution of the whole hypersphere it would be useful to estimate the corresponding distribution of hypersphere sectors, starting with the hyper-hemisphere one. Without loss of generality it can be assumed that the hyper-hemisphere is the part of the hypersphere for which . The ”pole” or, formally speaking, the Chebyshev centre [25] of the hyper-hemisphere, i.e. the point that has the minimum maximum distance, is the point . The existence of a unique Chebyshev centre (contrary to the hypersphere for which every point has equal maximum distance) implies that points in a hyper-hemisphere are not homogeneous. Therefore, when the number of points tends to infinity, the rows (and columns) of the distance matrix will not follow the same distribution.

However, the hyper-hemisphere is invariant to rotations around the axis, i.e. all points on the surface of the hyper-hemisphere with equal are produced by the rotation of the point () around axis. Since point distance is invariant to rotation, if , where and is the distance from point and , respectively and , are the respective chord length distributions. As a result, the probability that a hyper-hemispherical chord is smaller than () is:

(14)

where is the distance from the point . A point in the hyper-hemisphere has if-f it belongs on a hyperspherical cap centered in the hyper-hemisphere pole with colatitude angle . By equation 4 it follows that:

(15)

and, finally, that:

(16)

On the other hand, a hyper-hemispherical chord with one end in has a length less or equal than if-f it belongs in a corresponding hyperspherical cap of centre and maximum distance . Therefore, , where is the percentage of the hyperspherical cap of centre that lies within the hyper-hemisphere, is the total surface of the hyperspherical cap and is the total surface of the hyper-hemisphere. Since , Eq. (14) becomes:

(17)

where .

Note that the hyperspherical cap of centre , , is a rotated version of a same-size hyperspherical cap having as a centre the pole , . The rotation is on the plane that is defined by the centre of the sphere , the pole and the chord end , i.e. the plane defined by and , and the rotation angle is the angle between and , which in this case is .

is determined by the coordinate of , which is determined by the and coordinates of . Even though this seems as a -dimensional geometrical problem, it is more complex than that because and are correlated with the rest of the coordinates through the hypersphere equation. Still, is the percentage of points for which .

A first remark is that if then , because and . The inequality holds for half of points because the (N-1)-coordinate of the pole is and is an axis of symmetry of . Therefore, . Moreover, the integral of is because is a pdf. By substitution to Eq. (17) we confirm the following intuitive proposition.

Proposition III.5.

The hyper-hemisphere cdf is larger than the hypersphere cdf , i.e.

As a matter of fact, equals to if the rotation angle is sufficiently small. To estimate the range of for which it is reminded that the part of that lies within the hyper-hemisphere is the cut of the hypersphere with two hyperplanes, and . The cap lies entirely within the hyper-hemisphere (i.e. ) if-f the hyperplane intersection happens outside the hypersphere. This implies that:

(18)

The integral , by substituting , becomes . Therefore,

(19)

where is Eq. (17) with the upper integral limit changed according to Eq. (18) to .

The determination of in the case that (Fig. 4 is a rather challenging problem, which however can be linked to a single surface ratio:

(20)

where is the area of the locus for which , (Fig. 4).

Fig. 4: as the ratio of the shaded area in relation to the hyperspherical cap with maximum distance .

The area of can be estimates using the intersection of two hyperspherical caps, which has been recently examined in detail [26]. Using the taxonomy of [26], this corresponds to case No. , i.e. with axis angle less than , and the two hyperspherical caps angles and . According to [26], the hyperspherical cap part that does not intersect with the hyper-hemisphere is as follows:

(21)

where and . Estimating through (Eq. 21) and replacing to (Eq. 19) gives the generic formula of hyper-hemisphere chord length distribution. This is a rather challenging task and leads to complex and lengthy expressions even for small values of . As an example, the hyper-hemisphere chord length cdf for is given:

Proposition III.6.

If , the probability that a hyper-hemisphere chord is less or equal than , is , where:

(22)
(23)
(24)

Analogous closed-form expressions of the hyper-hemisphere chord length distribution can be estimated using (Eq. 21) and (Eq. 19) if required.

The cdf estimation is completed for by revisiting the hyperspherical symmetry of Eq. (3) and taking into account that and belong to different hyper-hemisphere. Therefore, for each pair of points that belong to the same hyper-hemisphere and have a distance there is a pair of points that belong to different hyper-hemispheres and have a distance and vice versa. This property defines an equation between the cdf of a hyper-sphere and the cdf of a hyper-hemisphere, which leads to the following property for the hyper-hemispherical cdf for :

Proposition III.7.

The probability that a hyper-hemisphere chord is less or equal than , is , where and are the hyper-spherical and hyper-hemispherical cdfs for , respectively.

Iii-B1 Basic properties of the hyper-hemisphere chord length distribution

Following the above analysis one can estimate both the cdf and the pdf of the hyper-hemisphere chord length distribution for any . However, it is apparent that the compactness of the (whole-)hypersphere case, i.e. the hyperspherical chord length distribution, is lost, thus implying that estimating analytical formulas of the chord length distribution of widely used hyperspherical segments and/or sectors is expected to be a rather challenging task.

Another difference between the hypersphere and hyper-hemisphere distribution is the fact that for the cdf is no longer independent from the hypersphere dimension (let alone equal to ). For example, by substituting in the equations of proposition III.6 it follows that , , i.e. . In this case, the divergence between the hyper-hemisphere and the hypersphere cdf value for , , is:

(25)

Finally, the independence of the second moment, , from the dimension does not hold for the hyper-hemisphere. However, the gradual decrease of the variance with dimension is also apparent in the hyper-hemisphere case, as shown in Table II. Table II summarises the basic properties of all hyper-hemisphere chord length distributions for dimensions to

N Mean Median Variance
3
4
5
6
TABLE II: Basic properties of the hyper-hemisphere chord length distribution for dimension 3 to 6.

Iv Hypersphere chord length distribution as a uniformity measure

As already mentioned, some of the most interesting properties of the hypersphere chord length distribution arise from the fact that this is the limit distribution of the distances of uniformly selected hypersphere points. To summarise, the hypersphere chord length distribution is the limit distribution of (related but distinct) distributions:

  1. The distance distribution of point-pairs , if the relevant points and are independently selected from a uniform random distribution.

  2. The intra-distance distribution of a set of points , if the relevant points are independently selected from a uniform random distribution, before the pairwise distances are estimated.

  3. The distance distribution of a set of points from a fixed point if both and are selected from a uniform random distribution before the pairwise distances are estimated.

The second and the third distribution allow the hypersphere chord length distribution to be used as an uniformity measure, as is described in the current section.

More specifically, in order to quantify the ”uniformity” of an input point distribution on a N-sphere, the L1 distance is used. If the intra-distance distribution of the input point distribution is then the uniformity measure is defined as follows:

(26)

where, is the hypersphere chord length distribution. Note that this uniformity measure can be used to quantify the uniformity of all types of point distance distributions that are described above.

The reason for selecting is double; firstly, it satisfies the metric conditions, thus defining a metric space; secondly, it was experimentally found that convergence rate to for a uniform distribution is , where is the number of point-pairs that are included in the distribution (, and , respectively, for the

examined types of point distance distributions). This allows the experimental computation of ”confidence intervals” for

even when takes an impractically large value. Initially, uniform pointsets of size () are generated on the N-sphere and values are sorted, before acquiring the -largest value and finally extrapolating for pointsets of size . This value is the threshold with which the input distance distribution is compared to determine whether it is uniform or not.

Elaborating on this idea, based on the computational cost of iteratively estimating pairwise distances, can be used to qualitatively assess whether an -dimensional point sample (consisting of points, and having an intra-distribution ) originates from a uniform N-sphere (or N-hemisphere) distribution following on of the three following approaches:

  • If dimension and sample size imply a non-prohibitive computational cost, then uniformly distributed point sets of size and dimension are randomly generated and is estimated for all of them (as well as for ). If the -largest value of is smaller than then can be declared as non-uniform with confidence .

  • If dimension and sample size imply a prohibitive computational cost for estimating for uniformly distributed point sets () but not for estimating , then the difference with the previous case is that the -largest value of is estimated using sets of -dimensional points () and extrapolating for .

  • If the computational cost needs to be further reduced then a point of is fixed and the distributions of the distances from this point are estimated. These distributions are still expected to have as a limit distribution the hypersphere chord length distribution, while the associated computational complexity is linear (instead of quadratic).

The above tests are designed to identify non-uniform spatial distribution, thus can not securely confirm uniformity. In practice, this is rarely expected to be of major importance because the uniformity-measurement objective is usually to assess whether the points span the hypersphere in a way that is compatible with the uniform distribution; not to mathematically confirm that they actually originate from a ”pure” uniform hypersphere distribution. For example, if an algorithm has a large number of hyperparameters and its exhaustive evaluation in the hyperparameter space is computationally expensive, a straightforward approach would be to sub-sample the hyperparameter space ”uniformly”, selecting a small number of points (i.e. hyperparameter combinations). In this case, uniformity of the set of hyperparameter-points is not a strict theoretic requirement but a rather loose condition so as to ensure that the evaluation does not omit a large neighbourhood in the hyperparameter space. The above tests would be sufficient to assess whether the hyperparameter-points were ”uniformly” selected or not.

Apart from the qualitative evaluation, can be used to generate a quantitative uniformity measure, specifically, the size of the maximum uniform subset of a N-sphere pointset (). The relevant presentation starts by reminding that a pointset can be considered as a mixture of a uniform subset () and a non-uniform subset (). The intra-distance distribution is a weighted average of distance distributions: (a) the (intra-)distance distribution of pairs selected from (b) the (inter-)distance distribution of pairs in which one point is selected from and one point is selected from and (c) the (intra-)distance distribution of pairs selected from , the respective weights being , and , where (note that ).

Since are selected from a uniform distribution, (it is assumed that ). Regarding the inter-distance distribution , this is also because is a sum of distance distributions in which one point of the pair is fixed on the hypersphere while the second is uniformly selected, i.e. (identical) hypersphere chord length distributions (as implied by the third distribution for which the hypersphere chord length distribution is the limit distribution). Therefore, .

Since can be straightforwardly estimated by the pointset , and (this follows by the inequality ), then a lower limit for is . Since and are infinite, then , i.e. . The ratio is the percentage of non-uniform points in , therefore, what has been proven is the following:

Proposition IV.1.

If a set of points on the N-sphere have an intra-distance distribution for which the distance from the N-sphere chord length distribution is