Persistence Weighted Gaussian Kernel for Probability Distributions on the Space of Persistence Diagrams

03/22/2018 ∙ by Genki Kusano, et al. ∙ Tohoku University 0

A persistence diagram characterizes robust geometric and topological features in data. Data, which will be treated here, are assumed to be drawn from a probability distribution and then the corresponding persistence diagrams have randomness. This paper reveals relationships between distributions and persistence diagrams in the viewpoint of (1) the strong law of large numbers and the central limit theorem, (2) confidence intervals, and (3) stability properties via the persistence weighted Gaussian kernel which is a statistical method for persistence diagrams. In numerical experiments for distributions, our method is compared against other statistical methods for persistence diagrams.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Topological data analysis (TDA) is a research field utilizing topological properties in data analysis, and there are a lot of successful works in various research fields; for example, material science [Nakamura et al., 2015, Hiraoka et al., 2016, Saadatfar et al., 2017], biochemistry [Cang et al., 2015, Gameiro et al., 2015], information science [Hiraoka and Kusano, 2016, de Silva and Ghrist, 2007], fluid dynamics [Kramár et al., 2016]

, computer vision

[Skraba et al., 2010], and so on. One of the key tools in TDA is persistent homology, which was first introduced in [Edelsbrunner et al., 2002] to describe robust topological features in data.

For explaining persistent homology intuitively, let us consider that input data is given as a point set. This situation appears in many applications in TDA. For example, by representing each atom in some material as its -dimensional coordinates, the atomic configuration of the material is represented as a point set of . As another example, let be a probability measure on a measurable space , then observations drawn from are also expressed as a point set . For a point set , we consider an union of balls with radius , where , and the homology group in dimension with field coefficient. Let denote the induced map of the inclusion . Here, we view a nonzero homology class representing a -dimensional topological hole in and call it a generator of the persistent homology. If is not zero, then it is interpreted that a generator exists at radius and still persists at radius . Here, the maximum and minimum of a radius at which a generator persists are called the birth time and the death time, respectively. The collection of all birth-death pairs of generators is a multiset of and called the persistence diagram (Figure 1). For a birth-death pair in a persistence diagram, its lifetime is called the persistence of . As Figure 1, a generator with large (resp. small) persistence is interpreted to represent a large (resp. small) topological hole. Hence, in many applications, generators with large persistence are viewed as important features in the input data, and generators with small persistence are treated as topological noise.

Figure 1: Unions of balls with several radii (left) and the persistence diagram in dimension (right). The persistence diagram encodes the birth and death time of each generator.

In applying a persistence diagram to various problems, it is desirable to analyze persistence diagrams statistically. However, the definition of a persistence diagram as a multiset of is inappropriate for directly considering its statistical properties. While standard statistical methods assume that the input data lies in a space with an inner product structure, a natural inner product for the space of persistence diagrams is not proposed. This gap is an obstacle to developing statistical methods for persistence diagrams. In order to avoid this obstacle, many research in statistical TDA consider to transform a persistence diagram to an element in a Hilbert space [Adams et al., 2017, Bubenik, 2015, Cang et al., 2015, Carriere et al., 2017, Donatini et al., 1998, Kusano et al., 2018, Reininghaus et al., 2015, Robins and Turner, 2016]. In this paper, we focus on the persistence weighted kernel (PWK) [Kusano et al., 2016, Kusano et al., 2018], 111This was originally called the persistence weighted Gaussian kernel in [Kusano et al., 2016, Kusano et al., 2018] because we mainly focused on the Gaussian kernel as the positive definite kernel, but the framework can be generalized to other positive definite kernels. Hence, we drop the word “Gaussian” here. which is a recently developed statistical method for persistence diagrams. The PWK is composed of a positive definite kernel and a weight function . While a one-variable function is in a functional space on , it is shown that is an element in a Hilbert space, which is called the reproducing kernel Hilbert space (RKHS) of . Here, we transform a persistence diagram to a weighted sum and call it the

PWK vector

.

Contributions

In an appropriate condition, a persistence diagram can be viewed as a sample drawn from some probability distribution . Then, the PWK vector

is a random variable taking values in the RKHS, and the expectation

is well-defined. Contributions in this paper are summarized as follows:

(1) Let be i.i.d. samples drawn from and be the sample mean. In order to understand probabilistic properties of the PWK vector, we will describe the convergence of to by showing the strong law of large numbers (Theorem 3.3) and the central limit theorem for PWK vectors (Theorem 3.4).

(2) From the strong law of large numbers for PWK vectors, we have the almost sure convergence as goes to . In practice, the number of samples is finite. Then, we are interested in how for the fixed number is close to . We formalize this question in the context of a confidence interval. A confidence interval for a parameter is an interval which contains the parameter with high probability. Since a PWK vector is a real-valued function on , by regarding as a real-valued function on , we construct the confidence interval for (Theorem 4.8).

(3) When the sample mean is viewed as the expectation of the empirical distribution of , the difference between two expectations of a (true) distribution and its empirical distribution is estimated to be small from the strong law of large numbers. Then, these probability distributions are also considered to be close with an appropriate distance on probability distributions when is large. We will generalize this by showing that the map from a probability distribution of persistence diagrams to the expectation of a PWK vector is Lipschitz continuous (Theorem 5.5), which is the stability of the expectation of a PWK vector.

Related work

A first study of vectorizing a persistence diagram is a persistence landscape [Bubenik, 2015], which is an function made from a persistence diagram, and [Bubenik, 2015] shows the strong law of large numbers and the central limit theorem for a persistence landscape. With respect to other topics, there are previous works of a uniform confidence band for a persistence landscape [Chazal et al., 2013, Chazal et al., 2014b] and the stability of the expectation of a persistence landscape [Chazal et al., 2015]. We remark that we have used several techniques in these papers to show our results. On the other hand, a PWK vector can flexibly measure a size of a generator by a weight function , while several statistical methods for persistence diagrams, including persistence landscapes, always return a large (resp. small) value for a generator with large (resp. small) persistence.

In numerical experiments, we compare the PWK vector and other statistical methods for persistence diagrams: the persistence landscape, the persistence scale-space kernel [Bubenik, 2015], and the Sliced Wasserstein kernel [Carriere et al., 2017]. While the persistence landscape and the Sliced Wasserstein kernel are parameter-free statistical methods, it is needed to select a positive definite kernel and a weight function for a PWK vector appropriately. In other words, depending on datasets, we can select and

flexibly to show higher performance of the PWK than other methods in statistical and machine learning tasks. One of the datasets which we will treat in this paper is related to material science, and it has important topological features whose persistence are small. Utilizing the weight factor in a PWK vector, we will show the advantage of our statistical method.

Organization

This paper is organized as follows: We review some the basics on persistence diagrams and a PWK vector in Section 2 and explain how to regard the PWK vector as a random variable taking values in the RKHS in Section 3. Then, we show the limit theorems in Section 3, construct a confidence interval for the expectation of a PWK vector in Section 4, and show the stability of the expectation of a PWK vector in Section 5. In Section 6, we show numerical results by using the PWK vector and compare performances with other statistical methods for persistence diagrams.

2 Preliminaries

2.1 Persistence Diagram

In this section, we briefly review the basics of persistent homology and persistence diagrams. Throughout this section, we fix a field. We refer the reader to [Hatcher, 2002] for homology groups and [Zomorodian and Carlsson, 2005] for persistence diagrams.

Let be a family of topological spaces. If is a subspace of for any , is called a filtration of topological spaces. Then, a family of the -th homology groups with coefficient is a family of -vector spaces. Since the inclusion induces the -linear map , there exists a family of -linear maps. Then, the collection of and is called the -th persistent homology of and denoted by .

A persistent homology is generalized as follows: Let be a family of -vector spaces and be a family of -linear maps. If is an identity map on and for any , the collection of -vector spaces and -linear maps is called a persistence module. For an interval222A subset is said to be an interval if, for any , satisfying is in . , the persistence module is called the interval persistence module over if for and otherwise, and is an identity map for any and is a zero map otherwise. It is known that an appropriate persistence module is decomposed to several interval persistence modules as , where is an index set and each is an interval of . We refer the reader to [Bubenik and Scott, 2014, Carlsson and de Silva, 2010, Crawley-Boevey, 2015, Zomorodian and Carlsson, 2005] for the details about the appropriate condition and the symbol and . We remark that all persistence modules which will be used in this paper satisfies the decomposition condition.

For an interval , we define the birth time and death time of by the endpoints and , respectively. The lifetime is called the persistence of the birth-death pair and denoted by . For an interval decomposable persistence module , the persistence diagram of is defined by a multiset333A multiset is a set with multiplicity of each point. Note that the collection of birth-death pairs should be a multiset because an interval decomposition of can contain several intervals with the same birth-death pairs. composed of all birth-death pairs in and denoted by

For the persistent homology of a filtration , the persistence diagram is denoted by for short. For a finite point set in a metric space , we define the ball model filtration , where and otherwise. Figure 1 shows the persistence diagram .

Since each death time is greater than or equal to the corresponding birth time, all birth-death pairs of lie on the region above the diagonal .444Precisely speaking, the definition of the birth and death time can contain and . However, in practical, we can assume that all birth and death times take neither nor (for more details, please see Section 2.1.2 in [Kusano et al., 2018]). For this reason, we define a generalized persistence diagram by a countable multiset of points in , and the set of all generalized persistence diagrams is denoted by .

Let denote the diagonal set with all points having infinite multiplicity. As a distance between persistence diagrams and , we use the bottleneck distance which is defined by

where is a multi-bijection.555By considering infinite multiplicity of the diagonal set , there always exists a multi-bijection from to . The bottleneck distance is also called -Wasserstein distance. When we define the set of persistence diagram as , then becomes a pseudometric space.

For persistence diagrams and , we define an equivalence if , and then the difference in the equivalence class is only on the diagonal set, which can be ignored in data analysis. Thus, we abuse the notation and let also denote the metric space of equivalence classes under .

From now on, let be a subspace in . In order to develop statistical analysis for persistence diagrams, it is convenient to focus on not but a subspace in . For example, when we analyze finite point sets in a metric space and compute their persistence diagrams via the ball model filtration, all resulting persistence diagrams are in

and is a subspace in . Let denote the set of finite point sets in , then is a metric space with the Hausdorff distance, which is defined by

As an important property for a persistence diagram, it is shown that the map is Lipschitz continuous, which is known as the stability of persistence diagrams:

Theorem 2.1 ([Chazal et al., 2014a, Cohen-Steiner et al., 2007]).

For any two finite point sets and in a metric space , we have

2.2 Persistence Weighted Kernel

In this section, we review the basics of a positive definite kernel and the PWK vector. We refer the reader to [Berlinet and Thomas-Agnan, 2011, Paulsen and Raghupathi, 2016] for the kernel method, which is the statistical theory of a positive definite kernel, and [Kusano et al., 2018] for the PWK vector.

Let be a set. A function is called a positive definite kernel on if is symmetric, i.e., for any , and the matrix is a nonnegative definite matrix for any finite number of . It is known from the Moore-Aronszajn theorem that a positive definite kernel uniquely defines a Hilbert space as a subspace of a real-valued functional space on , which is called the reproducing kernel Hilbert space (RKHS).

Theorem 2.2 (Moore-Aronszajn, Theorem 2.14 in [Paulsen and Raghupathi, 2016]).

Let be a set and be a positive definite kernel on . Then, there uniquely exists a reproducing kernel Hilbert space satisfying the following:

(1)

for any ,

(2)

is dense in with the uniform norm ,

(3)

for any and any .

We remark that the inner product of and is given by from the property (3) in Theorem 2.2, and the norm of is given by . For later use, we remark that the following inequality holds for any :

(1)

We call a positive definite kernel on bounded if there exists a constant such that , and Lipschitz continuous if there exists a constant such that

The following proposition is not used here, but will be used in Section 3.

Proposition 2.3.

A reproducing kernel Hilbert space of a bounded Lipschitz continuous positive definite kernel on is separable.

Proof.

For any , we have

When we use a metric derived from the uniform norm as a metric of , the map is Lipschitz continuous from the above inequality. Since is separable, is separable from the Lipschitz continuity of , and hence is also separable because is dense in with the uniform norm from Theorem 2.2. ∎∎

Let be a subspace in . We define a weight function for if it satisfies for any and there exists a constant such that . If is a measurable bounded positive definite kernel on and is a weight function for , for a persistence diagram , the weighted sum

is an element in the RKHS . We call the persistence weighted kernel vector (PWK vector) [Kusano et al., 2016, Kusano et al., 2018] of by and . We remark that the inner product is given by

(2)

Furthermore, for any , we have

(3)

and thus the norm of the PWK vector is always finite.

Here, we define classes for positive definite kernels and weight functions as follows:

Let be a weight function for . The class of weight functions such that there exists a constant satisfying

(4)

for any and any multi-bijection is denoted by

Then, the stability theorem for a PWK vector is shown as follows:

Theorem 2.4 ([Kusano et al., 2018]).

Let and . Then, for any persistence diagrams , we have

It is also shown in [Kusano et al., 2018] that the Gaussian kernel is in and the arctangent type weight function is in where is a triangulable compact subspace in and . In other words, for finite point sets and in a triangulable compact subspace , if , then the map is shown to be Lipschitz continuous from Theorem 2.4.

3 Limit theorems

In this section, we will regard a persistence diagram as a random value taking values in a metric space

and consider the expectation of the PWK vector in the RKHS. Before discussing probabilistic properties of a persistence diagram, we briefly review the basic of probability theory in a metric space, following

[Durrett, 2010, Van der Vaart, 1998], and probability theory in a Banach space, following [Ledoux and Talagrand, 2013]. Throughout this section, we fix a probability space.

3.1 Probability in a metric space

When is a measurable space, a measurable map is called an -valued random element. If is a topological space and is defined by the Borel -set of , an -valued random element is called a Borel random element. An -valued random element induces a probability measure from as , and the probability measure is called the probability distribution of . Then, we say that is drawn from and write . When -valued random elements are drawn from the same probability distribution and are independent, we say that are independently identically distributed from and this is denoted by . Let be a measurable map, then the integral is called the expectation of and denoted by . When we emphasize the probability distribution of , the expectation is also denoted by . An -valued random element is simply called a random variable. We consider an expectation for a map in Section 4, which may not be a random variable, and here define the outer expectation of a map by

Here, we define stochastic convergences for a sequence of maps in a metric space . Let be a sequence of maps and be an -valued random element. Then, definitions of standard convergences for random variables are generalized to ones for maps taking values in a metric space as follows:

  • If there exists a sequence of random variables satisfying for any and , then is said to converge to almost surely and it is denoted by .

  • If for all -Lipschitz function , then is said to converge in distribution to and it is denoted by .

3.2 Probability for a Banach space

Let be a Banach space and be a -valued Borel random element. If there exists an element satisfying for any , where be the topological dual space666 is the set of all continuous linear real-valued functions . of , the element is called the Pettis integral of and denoted by . If is Radon, the expectation always exists and it satisfies .777We call Radon if, for any , there exists a compact set in the Borel -set of such that . For more details, please see Section 2.1 in [Ledoux and Talagrand, 2013]. A -valued Radon random element is called a centered Gaussian random element if is a real valued Gaussian random variable with mean zero for any . A centered Gaussian random element is determined by its covariance structure which is defined by .

For -valued random elements , the sum is denoted by . The strong law of large numbers and the central limit theorem are stated as follows:

Theorem 3.1 (Strong law of large numbers, Theorem 10.5 in [Ledoux and Talagrand, 2013]).

Let be a separable Banach space, be a -valued Radon random element, and be a sequence of independent -valued random elements distributed as . Then, and if and only if .

Theorem 3.2 (Central limit theorem, Corollary 10.9 in [Ledoux and Talagrand, 2013]).

Let be a separable Banach space of type 2,888We do not define the concept of type 2 in this paper because a Hilbert space is of type 2 and a Banach space which will be used in this paper is a reproducing kernel Hilbert space. For more details, please see Section 9.2 in [Ledoux and Talagrand, 2013]. be a -valued Radon random element, and be a sequence of independent -valued random elements distributed as . If and , where is a centered Gaussian random variable with the same covariance structure as .

3.3 Limit theorems for the PWK vectors

Let be a subspace in and be a -valued Radon random element. If is continuous, the PWK vector of a -valued Radon random element can be viewed as an -valued Radon random element, and the expectation of the PWK vector is well defined in the RKHS . From now on, we assume that a positive definite kernel and a weight function are selected to make continuous. In fact, Theorem 2.4 ensures that there exist and such that is continuous.

Let be a Lipschitz continuous positive definite kernel on . Then, the RKHS is separable from Proposition 2.3, which satisfies the assumptions in the strong law of large numbers and the central limit theorem for a Banach space (Theorem 3.1 and Theorem 3.2). For any measurable bounded positive definite kernel and a weight function for , which are needed to define the PWK vector, we have confirmed in Equation (3) that the norm is always bounded from above independent of . For this reason, it is always satisfied that and . For -valued random element , denotes the sum . Applying Theorem 3.1 and Theorem 3.2 to , we obtain the following:

Theorem 3.3 (Strong law of large numbers for the PWK vector).

Let , be a weight function for , be a -valued Radon random element, and be a sequence of independent -valued random elements distributed as . Then, we have

Theorem 3.4 (Central limit theorem for the PWK vector).

Let , be a weight function for , be a -valued Radon random element, and be a sequence of independent -valued random elements distributed as . Then, we have

where is the centered Gaussian random variable with the same covariance structure as .

4 Confidence interval

Since the RKHS is a subspace of the real-valued functional space on , the PWK vector is a real-valued function on by

Let be a Radon probability measure on and , then the expectation is a real-valued function on , and its value at is denoted by . From the strong law of large numbers for the PWK vector (Thereom 3.3), we have confirmed that

where . Furthermore, the uniform convergence

follows from Equation (1). Since the i.i.d. number is finite in practice, to measure how the sample mean is close to the expectation , we will estimate a number satisfying the following inequality:

(5)

This estimation is based on the concept of a uniform confidence band for a stochastic process when we see as a stochastic process. We first review a uniform confidence band for a stochastic process, following [Kosorok, 2008, Van der Vaart, 1998], and then construct the uniform confidence band for the expectation of a PWK vector.

4.1 Review of a uniform confidence band

Let be a measurable space, be an -valued random element, be the distribution of , be a real valued parameter attached to , and . For simplicity, the collection of the random elements is denoted by . In the concept of interval estimation, we construct an interval made from which contains with high probability. To be precise, for and two statistics and satisfying , an interval is called a confidence interval for at level if it satisfies

Example 4.1.

Let ,999

denotes the normal distribution with mean

and variance

. where is unknown and is known, and be a statistics. Since , we have . Then, for the upper

-quantile

satisfying , is the confidence interval for at because .

Fortunately, the quantile of the distribution in Example 4.1 is numerically computable from the property of the standard normal distribution. In the case of a general statistic , its distribution of may be unknown or the quantile may be hard to compute. For such a case, the bootstrap method is a powerful tool to estimate the quantile.

Let be a statistic of . For any , we define a probability distribution on by where is the Dirac delta measure at . The random probability measure is called the empirical distribution of . Let be a map which transforms to an -valued random element. If for any , is called a bootstrap sample from and denoted by . Even after we fix , is still an -valued random element, that is, for . Here, we define a function by .

In order to define the convergence in distribution for , we define the conditional convergence in distribution. Let be a sequence of bootstrap samples and be an -valued random element. Recall that the convergence in distribution for some is defined by for all -Lipschitz function . Thus, in a similar way, we define that a sequence converges in distribution to conditionally on if a function which is defined by

where is the set of all -Lipschitz functions , satisfies