Testing goodness of fit for point processes via topological data analysis

06/18/2019 ∙ by Christophe Ange Napoléon Biscio, et al. ∙ ULCO University of Mannheim Aalborg University 0

We introduce tests for the goodness of fit of point patterns via methods from topological data analysis. More precisely, the persistent Betti numbers give rise to a bivariate functional summary statistic for observed point patterns that is asymptotically Gaussian in large observation windows. We analyze the power of tests derived from this statistic on simulated point patterns and compare its performance with global envelope tests. Finally, we apply the tests to a point pattern from an application context in neuroscience. As the main methodological contribution, we derive sufficient conditions for a functional central limit theorem on bounded persistent Betti numbers of point processes with exponential decay of correlations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Topological data analysis (TDA) provides insights into a variety of datasets by capturing their most salient properties via refined topological features. Since the mathematical field of topology specializes in describing invariants of objects independently of the choice of a precise metric, these features are robust against small perturbations or different embeddings of the object [11, 12]. Among the most classical topological invariants are the Betti numbers. Loosely speaking, they capture the number of -dimensional holes of the investigated structure. TDA refines this idea substantially by constructing filtrations and tracing when topological features appear and disappear. In point pattern analysis, simplicial complexes are built so that they are topologically equivalent to a union of balls with the same radius and centered at the data points, see the first three panels of Figure 1. As the radius increases, a sequence of simplicial complexes is then defined. Examples of such complexes are the basic Čech complex or the more elaborate -complex, which is based on the Delaunay triangulation, see [18]. In that framework, 1-dimensional features correspond to loops in the simplicial complexes while 0-dimensional features correspond to connected components. When moving up in the filtration, additional edges appear and at some point create new loops. On the other hand, more and more triangles also appear, thereby causing completely filled loops to disappear. Usually, the filtration is indexed by time, and we refer to the appearance and disappearance of features as births and deaths. We refer the reader to [18] for a detailed presentation of these concepts. The persistence diagram visualizes the time points when the features are born and die, see the bottom-right panel in Figure 1. Persistent Betti numbers count the number of events in upper-left blocks of the persistence diagram and are also illustrated in the figure.


Figure 1. Top: Realization of Poisson point process (left) and union of balls centered at the points of the process (right). Bottom: Alpha-complex corresponding to the union of balls with alive (blue) and dead (red) loops marked (left). Associated persistence diagram (right).

In this paper, we leverage persistent Betti numbers to derive goodness-of-fit tests for planar point processes. In this setting, the abstract general definition of persistent Betti numbers gives way to a clear geometric intuition induced by a picture of growing disks centered at the points of the pattern and all having radius , corresponding to the index of the filtration. Features of dimension 0 correspond to connected components in the union of balls, interpreted as point clusters, whereas boundaries of the complement set can be considered as the loops forming the 1-dimensional features. Since the notion of clusters in the sense of connected components lies at the heart of persistent Betti numbers in degree 0, they become highly attractive as a tool to detect clustering in point patterns. Our tests are based on a novel functional central limit theorem (CLT) for the persistent Betti numbers in large domains. The present work embeds into two active streams of current research.

First, now that TDA has become widely adopted, the community is vigorously working towards putting the approach on a firm statistical foundation paving the way for hypothesis testing. On the one hand, this encompasses large-sample Monte Carlo tests when working on a fixed domain [6, 9, 13]

. Although these tests are highly flexible, the test statistics under the null hypothesis must be re-computed each time when testing observations in a different window. In large domains, this becomes time-consuming. On the other hand, there has been substantial progress towards establishing CLTs in large domains for functionals related to persistent Betti numbers

[39, 40, 29, 35, 25]. However, these results are restricted to the null hypothesis of complete spatial randomness – i.e., the Poisson point process – and establish asymptotic Gaussianity on a multivariate, but not on a functional level. Our proof of a functional CLT is based on recently developed stabilization techniques for point processes with exponential decay of correlations [8]. As explained in the final section of [10], the main technical step towards a functional CLT are bounds on the cumulants.

Second, the introduction of global rank envelope tests has lead to a novel surge of research activity in goodness-of-fit tests for point processes [34]. One of the reasons for their popularity is that they rely on functional summary statistics rather than scalar quantities. Thus, they reveal a substantially more fine-grained picture of the underlying point pattern. In the overwhelming majority of cases, variants of the

-function are used as a functional summary statistic, thereby essentially capturing the relative density of point pairs at different distances. Here, the persistent Betti numbers offer an opportunity to augment the basic second-order information by more refined characteristics of the data. Still, even for classical summary statistics, rigorous limit theorems in large domains remain scarce. For instance, a functional central limit theorem of the estimated

-function is proven in detail only for the Poisson point process in [22] and an extension to -determinantal point processes is outlined in [23].

The rest of the manuscript is organized as follows. First, in Section 2, we introduce the concepts of -bounded persistence diagrams and -bounded persistent Betti numbers. Next, in Section 3, we state the two main results of the paper, a CLT for the -bounded persistence diagram and a functional CLT for the -bounded persistent Betti numbers. In Section 4, we provide specific examples of point processes satisfying the conditions of the main results. Sections 5 and 6 explore TDA-based tests for simulated and real datasets, respectively. Finally, Section 7 summarizes the findings and points to possible avenues of future research. The proofs of the main results are deferred to Sections 8 and 9 of the appendix.

2. -bounded persistent Betti numbers

For a locally finite point set , persistent Betti numbers provide refined measures for the amount of clusters and voids on varying length scales. More precisely, we let

(1)

denote the union of closed disks of radius centered at points in . A 0-dimensional topological feature is a connected component of this union, corresponding to a cluster of points in , while a 1-dimensional feature can be thought of as a bounded connected component of the background space, often identified with its boundary loop, and describes a vacant area in the plane. As the disks grow, new features arise and vanish; we say that they are born and die again. The persistent Betti numbers quantify this evolution of clusters and loops. Henceforth, we consider the persistence diagram only until a fixed deterministic radius .

As approaches the critical radius for continuum percolation, long-range phenomena emerge [31]. Thus, determining whether two points are connected could require exploring large regions in space. While useful quantitative bounds on cluster sizes are known for Poisson point processes [1], for more general classes of point processes the picture remains opaque and research is currently at a very early stage [27, 7]. Recently, a central limit theorem for persistent Betti numbers has been established in the Poisson setting [29, 25], but for general point processes the long-range interactions pose a formidable obstacle towards proving a fully-fledged functional CLT.

From a more practical point of view, these long-range dependencies are of less concern. Although large features can carry interesting information, we expect that spatially bounded topological features already provide a versatile tool for the statistical analysis of both simulated point patterns and real datasets, even when focusing only on features of a bounded size. For that purpose, we concentrate on features whose spatial diameter does not exceed a large deterministic threshold .

To define these -bounded features, we introduce the Gilbert graph on the vertex set . The Gilbert graph has for vertices the points in and two points are connected by an edge if the distance between them is at most or, equivalently, if the two disks of radius centered at the points intersect.

2.1. -bounded clusters

The 0-dimensional -bounded features alive at time are the connected components of with diameter at most . Starting at , all points belong to separate connected components that merge into larger clusters when increases. We thus say that all components are born at time 0.

To define the death time of a component, let denote the connected component of in . The components of meet at time

Then, the death time of is the smallest such that the spatial diameter of exceeds or such that is lexicographically larger than , where are the points of and whose associated disks meet at time . This ordering determines which component dies when two of them meet.

2.2. -bounded loops

Next, we introduce 1-dimensional features. At time , these correspond to holes, i.e., bounded connected components in the vacant phase . In contrast to the clusters, there are no holes at time , so that both birth and death times must be specified. Moreover, it needs to be defined how holes are related for different radii .

The death time of a hole in is the first time when the hole is completely covered by disks, i.e., . We identify a hole with the point that is covered last. Thus, holes in and in , are identified if .

New holes in can only appear when two balls merge, which corresponds to including a new edge in . When a new hole is formed, it can happen in two ways: either a finite component is separated from the infinite component, or an existing hole is split in two. In both cases, we define the size of the newly created hole(s) as follows: Let be the points in such that the disks of radius around the points intersect the boundary of the hole in . Then, the size of is the diameter of the set . The size remains unchanged until the next time the hole is split into smaller pieces. Then the size is recomputed for both new holes. This definition ensures that the size decreases when the balls grow and only changes when a new edge is added to .

The birth time of a hole is the minimal such that there is a hole in with and size less than . By an -bounded loop, we mean a loop with size lower than .

2.3. The persistence diagram

We now adapt the definition of the persistence diagram in [25] to only include -bounded features. That is, we define the th -bounded persistence diagram, , as the empirical measure

(2)

where is an index set over all -bounded -dimensional features that die before time and are the birth and death times of the th feature. Then, the th -bounded persistent Betti numbers

are the number of -bounded features born before time and dead after time . When , all features are born at time 0, so that only death times are relevant. Hence, we write instead of the more verbose .

3. Main results

Henceforth, denotes a simple stationary point process in . We think of

as a random variable taking values in the space of locally finite subsets

of endowed with the smallest -algebra

such that the number of points in any given Borel set becomes measurable. Throughout the manuscript, we assume that the factorial moment measures exist and are absolutely continuous. In particular, writing

, the th factorial moment density is determined via the identity

(3)

for any pairwise disjoint bounded Borel sets , where denotes the number of points of in . Moreover, as we rely on the framework of [8], we also require that exhibits exponential decay of correlations. Loosely speaking this expresses an approximate factorization of the factorial moment densities and is made precise in Section 4 below. Many of the most prominent examples of point processes appearing in spatial statistics exhibit exponential decay of correlations [8, Section 2.2].

Our first main result is a CLT for the persistence diagram built on the restriction of the point process to a large observation window . With a slight abuse of notation, we write . To prove the CLT, we impose an additional condition concerning moments under the reduced -point Palm distribution . We recall that this distribution is determined via

(4)

for any bounded measurable , where denotes -tuples of pairwise distinct points in . Then, we impose the following moment condition.

  • For every

To state the CLT for the persistence diagram precisely, we let

denote the integral of a bounded measurable function with respect to the measure .

Theorem 3.1 (CLT for persistence diagrams).

Let , and be a bounded measurable function. Assume that exhibits exponential decay of correlations and satisfies condition (M). Furthermore, assume that for some . Then,

converges in distribution to a standard normal random variable as .

In order to derive a functional CLT for the persistent Betti numbers, we add a further constraint on

, which is needed to establish a lower bound on the variance via a conditioning argument in the vein of

[38, Lemma 4.3]. For this purpose, we consider a random measure , which is jointly stationary with and which we think of as capturing additional useful information on the dependence structure of . For instance, if is a Cox point process, we choose to be the random intensity measure. If is a Poisson cluster process, then would describe the cluster centers. If the dependence structure is exceptionally simple, it is also possible to take . The idea of using additional information is motivated from conditioning on the spatially refined information coming from the clan-of-ancestors construction in Gibbsian point processes [38].

The point process is conditionally -dependent if and are conditionally independent given for any bounded Borel sets such that the distance between and is larger than some . Here, denote the -algebra generated by and .

Finally, we impose an absolute continuity-type assumption on the Poisson point process in a fixed box with respect to when conditioned on and the outside points. More precisely, we demand that there exists with the following property, where denotes a homogeneous Poisson point process in the window .

  1. Let be such that . Then,

Although (AC) appears technical, Section 4 illustrates that it is tractable for many commonly used point processes.

Since the persistent Betti numbers exhibit jumps at the birth- and death times of features, we work in the Skorokhod topology [5, Section 14].

Theorem 3.2 (Functional CLT for persistent Betti numbers).

Let and be a conditionally -dependent point process with exponential decay of correlations and satisfying conditions (M) and (AC). Then, the following convergence statements hold true.

  1. The one-dimensional process

    converges weakly in Skorokhod topology to a centered Gaussian process.

  2. The two-dimensional process

    converges weakly in Skorokhod topology to a centered Gaussian process.

Additionally, [8, Theorem 1.12] implies convergence of the rescaled variances. While Theorem 3.1 is an adaptation of [8], Theorem 3.2 is much more delicate. As an application of Theorem 3.2, we obtain a functional CLT for the following two characteristics, which are modified variants of the accumulated persistence function from [6]:

and

Corollary 3.3 (Functional CLT for the APF).

Let and be as in Theorem 3.2. Then, both and converge to centered Gaussian processes.

4. Examples of point processes

In this section, we give examples of point processes which satisfy the assumptions of our main theorems. More precisely, we show that log-Gaussian Cox processes with compactly supported covariance functions and Matérn cluster processes both satisfy the conditions of Theorems 3.1 and 3.2. We also show that the Ginibre point process satisfies the conditions of Theorem 3.1.

Conversely, we do not expect that hard-core point processes satisfy the functional central limit theorem in the generality of Theorem 3.2

. Indeed, hard-core conditions put a strict lower bound on the death time of clusters and the birth time of loops. However, we believe that suitable repulsive point processes, where the hard-core conditions only need to be imposed with a certain probability can be embedded in the framework of Theorem

3.2.

We first recall the definition of exponential decay of correlations from [8]. To this end, we define the separation distance between and as in [8, Formula (1.3)] via

(5)
Definition 4.1.

Let be a stationary point process in , such that the -point correlation function exists for all . Then, exhibits exponential decay of correlations if there exist , such that

  1. for all ,

  2. for some ,

  3. for any .

4.1. Log-Gaussian Cox process

Let be a stationary Gaussian process with mean and covariance function . Then, the random measure on defined as , for any Borel subset has moments of any order. Let be a Cox process with random intensity measure , referred to as a Log-Gaussian Cox process. By [15, Equation (7)], the factorial moment densities of are given by

To apply Theorems 3.1 and 3.2, we assume that is bounded and of compact support, which ensures that exhibits exponential decay of correlation.

We show below that condition is satisfied. Let . According to [16, Theorem 1], the Log-Gaussian Cox process under the reduced Palm version is also a Log-Gaussian Cox process with underlying Gaussian process . According to [17, Equation (5.4.5)],

for suitable coefficients , where denotes the th factorial moment density with respect to . Therefore, it is enough to prove that

for all . Now, Equation (8) in [15] gives that

where the right-hand side is bounded as and are bounded independently of . This verifies condition (M).

Since conditionally on , the point process is a Poisson point process, the conditional -dependence property holds with .

It remains to verify condition . By [33, Equation (6.2)], conditionally on , the distribution of the point process admits the density with respect to a homogeneous Poisson point process with intensity 1 in given by

where . In particular, is strictly positive for all . Therefore, if are two events such that , then . This verifies condition .

4.2. Matérn cluster process

Let be a homogeneous Poisson point process in with intensity . Given a realization of , we define a family of independent point processes , where , , is a homogeneous Poisson point process with intensity 1 in the disk of radius centered at . The point process is referred to as a Matérn cluster process. Since is -dependent, it exhibits exponential decay of correlations.

Next, we verify condition (M). For this purpose, we deduce from [16, Section 5.3.2] that a Matérn cluster process is a Cox process whose random intensity measure has as density the random field given by

Now, let and be fixed. From [16, Equations (19) and (20)] we obtain that

Since is increasing in for every in the sense of [30], the Harris-FKG inequality [30, Theorem 20.4] gives that

where we used that is a Poisson random variable with parameter . In order to bound , we first apply the Hölder inequality and stationarity, to arrive at

First, is finite since is a Poisson point process. For the remaining part, we note that , where denotes the Minkowski sum. Hence,

where is a homogeneous Poisson point process of intensity 1 in the disk [30, Theorem 3.9]. Again, since is a Poisson random variable with parameter , the latter expression is finite. Taking the supremum over all and all , this verifies condition (M). The point process is also conditionally -dependent, by taking and .

It remains to prove . By [33, Equation (6.2)], conditional on , the distribution of admits the density

with respect to the distribution of a homogeneous Poisson point process. Now, consider the event on the event

the density is positive. Therefore, if are such that , then almost surely

Since occurs with positive probability, this proves condition (AC).

4.3. Ginibre point process

The Ginibre point process is a determinantal point process with kernel

with . As mentioned in [8, p. 19], this point process exhibits exponential decay of correlation. According to [21, Theorem 2], for we have , where the right-hand side is finite by [26, Lemma 4.2.6]. Hence, we obtain an upper bound for , which is independent of , thereby verifying condition (M).

5. Simulation study

We elucidate in a simulation study, how cluster- and loop-based test statistics derived from Theorem 3.2 can detect deviations from complete spatial randomness. The simulations are carried out on top of the R-packages spatstat and TDA [20, 2].

For the entire simulation study, the null model is a Poisson point process with intensity in a observation window. Moreover, we fix so large that it encompasses the entire sampling window and therefore suppress its appearance in the notation. Although the proof of Theorem 3.2 relies on the -boundedness, the simulation study illustrates that it is not critical to impose this condition when testing hypotheses on common point patterns.

5.1. Deviation tests

As a first step, we derive scalar cluster- and loop-based test statistics.

5.1.1. Definition of test statistics

As a test statistic based on clusters, we use the integral over the number of cluster deaths in a time interval with , i.e.,

(6)

After subtracting the mean, this test statistic becomes reminiscent of the classical Cramér-von-Mises statistic except that we do not consider squared deviations. Although squaring would make it easier to detect two-sided deviations, it would also require knowledge of quantiles of the square integral of a centered Gaussian process. Albeit possible, this incurs substantial computational expenses. Our simpler alternative has the appeal that as an integral of a Gaussian process,

is asymptotically normal and therefore characterized by its mean and variance.

As a test statistic based on loops, we use the accumulated persistence function, which aggregates the life times of all loops with birth times in a time interval with , i.e.,

(7)

By Corollary 3.3, after centering and rescaling, the statistic converges in the large-volume limit to a normal random variable.

The statistics and are specific possibilities to define scalar characteristics from the persistence diagram. Depending on the application context other choices, such as instead of could be useful. However, in the simulation study below we found the weighting by life times of clusters to be detrimental.

5.1.2. Exploratory Analysis

As alternatives to the Poisson null hypothesis, we consider the attractive Matérn cluster and the repulsive Strauss processes. More precisely, the Matérn cluster process features a Poisson parent process with intensity 2 and generates a number of offspring uniformly in a disk of radius around each parent. The Strauss process has interaction parameter and interaction radius . The intensity parameter was tuned so as to match approximately the intensity of the null model. Figure 2 shows realizations of the null model and the alternatives.

Figure 2. Samples from the null model (left), the process (center) and the process (right).

In a first step, in Figure 3, we plot the persistence diagrams of samples from the null model and of the alternatives.

Figure 3. Persistence diagrams for cluster-based features with density plots (top) and loop-based features (bottom) for the null model (left), the process (center) and the process (right).

From the cluster-based diagrams, it becomes apparent that in comparison to the null model, in the Matérn cluster process, features can die also at rather late times, whereas this happens very rarely in the Strauss process. When analyzing loops, we see that loops with long life times can appear earlier in the null model than in the Matérn cluster process. Conversely, while some loops with substantial life time emerge at later times in the null model, there are very few such cases in the Strauss model.

5.1.3. Mean and variance under the null model

Now, we determine the mean and variance of and under the null model with . For this purpose, we compute the number of cluster deaths and accumulated loop life times for 10,000 independent draws of the null model.

Comparing the mean curves for the number of cluster deaths in the null model with those of the alternatives matches up nicely with the intuition about attraction and repulsion. For late times, they all approach a common value, namely the expected number of points in the observation window. However, Figure 4 shows that for the Matérn model, the slope is far steeper for early times, caused by merging of components of points within a cluster. In contrast, for the Strauss process the increase is at first much less pronounced than in the Poisson model, thereby reflecting the repulsive nature of the Gibbs potential.

Figure 4. Mean number of cluster deaths (left) and accumulated loop life times (right) for the null model (red) and the alternatives (green and blue) based on 10,000 realizations.

For the loops, a radically different picture emerges. Here, the curve for the Strauss process lies above the accumulated loop life times of the null model. The Strauss model spawns substantially more loops than the Poisson model, although most of them live for a shorter period. Still, taken together these competing effects lead to a net increase of the accumulated loop life times in the Strauss model.

5.1.4. Type I and II errors

By Theorem 3.2, the statistics and are asymptotically normal, so that knowing the mean and variance allows us to construct a deviation test whose nominal confidence level is asymptotically exact. For the loops, we can choose the entire relevant time range, so that . For the cluster features, this choice would be unreasonable, as for late times, we simply obtain the number of points in the observation window, which is not discriminative. Hence, we set . We stress that in situations with no a priori knowledge of a good choice of , the test power can degrade substantially.

To analyze the type I and II errors, we draw 1,000 realizations from the null model and from the alternatives, respectively. Table 1 shows the rejection rates of this test setup. Under the null model the rejection rates are close to the nominal 5%-level, thereby illustrating that already for moderately large point patterns the approximation by the Gaussian limit is accurate. Using the mean and variance from the null model, we now compute the test powers for the alternatives. Already leads to a test power of approximately for both alternatives. When considering

, we obtain a type I error rate of 4.8%, so that the confidence level is kept. Moreover, the power analysis reveals that in the present simulation set-up,

is better in detecting deviations from the null hypothesis than .

5.1% 59.3% 60.7%
4.8% 94.7% 71.4%
Table 1. Rejection rates for the test statistics