# The density of expected persistence diagrams and its kernel based estimation

Persistence diagrams play a fundamental role in Topological Data Analysis where they are used as topological descriptors of filtrations built on top of data. They consist in discrete multisets of points in the plane R^2 that can equivalently be seen as discrete measures in R^2. When the data come as a random point cloud, these discrete measures become random measures whose expectation is studied in this paper. First, we show that for a wide class of filtrations, including the Čech and Rips-Vietoris filtrations, the expected persistence diagram, that is a deterministic measure on R^2 , has a density with respect to the Lebesgue measure. Second, building on the previous result we show that the persistence surface recently introduced in [Adams & al., Persistence images: a stable vector representation of persistent homology] can be seen as a kernel estimator of this density. We propose a cross-validation scheme for selecting an optimal bandwidth, which is proven to be a consistent procedure to estimate the density.

## Authors

• 14 publications
• 6 publications
• ### A flat persistence diagram for improved visualization of topological features in persistent homology

Visualization in the emerging field of topological data analysis has pro...
12/11/2018 ∙ by Raoul R. Wadhwa, et al. ∙ 0

• ### Nonparametric Estimation of Probability Density Functions of Random Persistence Diagrams

We introduce a nonparametric way to estimate the global probability dens...
03/07/2018 ∙ by Joshua Lee Mike, et al. ∙ 0

• ### Optimal quantization of the mean measure and application to clustering of measures

This paper addresses the case where data come as point sets, or more gen...
02/04/2020 ∙ by Frédéric Chazal, et al. ∙ 0

• ### RGB image-based data analysis via discrete Morse theory and persistent homology

Understanding and comparing images for the purposes of data analysis is ...
01/09/2018 ∙ by Chuan Du, et al. ∙ 0

• ### Neural Persistence: A Complexity Measure for Deep Neural Networks Using Algebraic Topology

While many approaches to make neural networks more fathomable have been ...
12/23/2018 ∙ by Bastian Rieck, et al. ∙ 16

• ### Persistence B-Spline Grids: Stable Vector Representation of Persistence Diagrams Based on Data Fitting

Over the last decades, many attempts have been made to optimally integra...
09/17/2019 ∙ by Zhetong Dong, et al. ∙ 0

• ### Estimation and Quantization of Expected Persistence Diagrams

Persistence diagrams (PDs) are the most common descriptors used to encod...
05/11/2021 ∙ by Vincent Divol, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Persistent homology [16], a popular approach in Topological Data Analysis (TDA), provides efficient mathematical and algorithmic tools to understand the topology of a point cloud by tracking the evolution of its homology at different scales. Specifically, given a scale (or time) parameter and a point cloud of size , a simplicial complex is built on thanks to some procedure, such as, e.g., the nerve of the union of balls of radius centered on the point cloud or the Vietoris-Rips complex. Letting the scale increase gives rise to an increasing sequence of simplicial complexes called a filtration. When a simplex is added in the filtration at a time , it either "creates" or "fills" some hole in the complex. Persistent homology keeps track of the birth and death of these holes and encodes them as a persistence diagram that can be seen as a relevant and stable [6, 7] multi-scale topological descriptor of the data. A persistence diagram is thus a collection of pairs of numbers, each of those pairs corresponding to the birth time and the death time of a -dimensional hole. A precise definition of persistence diagram can be found, for example, in [16, 8]. Mathematically, a diagram is a multiset of points in

 Δ={r=(r1,r2),0≤r1

Note that in a general setting, points in diagrams can be "at infinity" on the line (e.g. a hole may never disappear). However, in the cases considered in this paper, this will be the case for a single point for -dimensional homology, and this point will simply be discarded in the following.

In statistical settings, one is often given a (i.i.d.) sample of (random) point clouds and filtrations built on top of them. We consider the set of persistence diagrams , which are thought to contain relevant topological information about the geometry of the underlying phenomenon generating the point clouds. The space of persistence diagrams is naturally endowed with the so-called bottleneck distance [12] or some variants. However, the resulting metric space turns out to be highly non linear, making the statistical analysis of distributions of persistence diagrams rather awkward, despite several interesting results such as, e.g., [28, 15, 10]. A common scheme to overcome this difficulty is to create easier to handle statistics by mapping the diagrams to a vector space thanks to a feature map , also called a representation (see, e.g., [1, 2, 4, 9, 11, 20, 25]). A classical idea to get information about the typical shape of a random point cloud is then to estimate the expectation of the distribution of representations using the mean representation

 ¯¯¯¯ΨN\vcentcolon=∑Ni=1Ψ(Ds[K(Xi)])N. (2)

In this direction, [4]

introduces a representation called persistence landscape, and shows that it satisfies law of large numbers and central limit theorems. Similar theorems can be shown for a wide variety of representations: it is known that

is a consistent estimator of . Although it may be useful for a classification task, this mean representation is still somewhat disappointing from a theoretical point of view. Indeed, what exactly is, has been scarcely studied in a non-asymptotic setting, i.e. when the cardinality of the random point cloud is fixed or bounded.

Asymptotic results, when the size of the considered point clouds goes to infinity, are well understood for some non-persistent descriptors of the data, such as the Betti numbers: a natural question in geometric probability is to study the asymptotics of the

-dimensional Betti numbers where is a point cloud of size and under different asymptotics for . Notable results on the topic include [17, 30, 31]. Considerably less results are known about the asymptotic properties of fundamentally persistent descriptors of the data: [3] finds the right order of magnitude of maximally persistent cycles and [14] shows the convergence of persistence diagrams on stationary process in a weak sense.

#### Contributions of the paper.

In this paper, representing persistence diagrams as discrete measures, i.e. as element of the space of measures on , we establish non-asymptotic global properties of various representations and persistence-based descriptors. A multiset of points is naturally in bijection with the discrete measure defined on created by putting Dirac measures on each point of the multiset, with mass equal to the multiplicity of the point. In this paper a persistence diagram is thus represented as a discrete measure on and with a slight abuse of notation, we will write

 Ds=∑r∈Dsδr, (3)

where denotes the Dirac measure in r and where, as mentioned above, points with infinite persistence are simply discarded. A wide class of representations, including the persistence surface [1] (variants of this object have been also introduced [11, 20, 25]), the accumulated persistence function [2] or persistence silhouette [9] are conveniently expressed as for some function on . Given a random set of points , the expected behavior of the representations is well understood if the expectation of the distribution of persistence diagrams is understood, where the expectation of a random discrete measure is defined by the equation for all Borel sets (see [23] for a precise definition of in a more general setting). Our main contribution (Theorem 3) consists in showing that for a large class of situations the expected persistence diagram , which is a measure on , has a density with respect to the Lebesgue measure on . Therefore, is equal to , and if properties of the density are shown (such as smoothness), those properties will also apply to the expectation of the representation .

The main argument of the proof of Theorem 3 relies on the basic observation that for point clouds of given size , the filtration can induce a finite number of ordering configurations of the simplices. The core of the proof consists in showing that, under suitable assumptions, this ordering is locally constant for almost all . As one needs to use geometric arguments, having properties only satisfied almost everywhere is not sufficient for our purpose. One needs to show that properties hold in a stronger sense, namely that the set on which it is satisfied is a dense open set. Hence, a convenient framework to obtain such properties is given by subanalytic geometry [26]. Subanalytic sets are a class of subsets of that are locally defined as linear projections of sets defined by analytic equations and inequations. As most considered filtrations in Topological Data Analysis result from real algebraic constructions, such sets naturally appear in practice. On open sets where the combinatorial structure of the filtration is constant, the way the points in the diagrams are matched to pairs of simplices is fixed: only the times/scales at which those simplices appear change. Under an assumption of smoothness of those times, and using the coarea formula [24], a classical result of geometric measure theory generalizing the change of variables formula in integrals, one then deduces the existence of a density for .

Among the different representations of the form

, persistence surface is of particular interest. It is defined as the convolution of a diagram with a gaussian kernel. Hence, the mean persistence surface can be seen as a kernel density estimator of the density

of Theorem 3. As a consequence, the general theory of kernel density estimation applies and gives theoretical guarantees about various statistical procedures. As an illustration, we consider the bandwidth selection problem for persistence surfaces. Whereas Adams et al. [1] states that any reasonable bandwidth is sufficient for a classification task, we give arguments for the opposite when no "obvious" shapes appear in the diagrams. We then propose a cross-validation scheme to select the bandwidth matrix. The consistency of the procedure is shown using Stone’s theorem [27]. This procedure is implemented on a set of toy examples illustrating its relevance.

The paper is organized as follow: section 2 is dedicated to the necessary background in geometric measure theory and subanalytic geometry. Results are stated in section 3, and the main theorem is proved in section 4. It is shown in section 5 that the main result applies to the Čech and Rips-Vietoris filtrations. Section 6 is dedicated to the statistical study of persistence surface, and numerical illustrations are found in section 7. All the technical proofs that are not essential to the understanding of the idea and results of the paper have been moved to the Appendix.

## 2 Preliminaries

### 2.1 The coarea formula

The proof of the existence of the density of the expected persistence diagram depends heavily on a classical result in geometric measure theory, the so-called coarea formula (see [24] for a gentle introduction to the subject). It consists in a more general version of the change of variables formula in integrals. Let be a metric space. The diameter of a set is defined by .

Let be a non-negative integer. For , and , consider

 Hδk(A)\vcentcolon=inf{∑idiam(Ui)k,A⊂ ⋃iUi and diam(Ui)<δ}. (4)

The -dimensional Haussdorf measure on of is defined by . If is a -dimensional submanifold of , the -dimensional Haussdorf measure coincides with the volume form associated to the ambient metric restricted to . For instance, if is an open set of , the Haussdorf measure is the -dimensional Lebesgue measure.

[Coarea formula [24]] Let (resp. ) be a smooth Riemannian manifold of dimension (resp ). Assume that and let be a differentiable map. Denote by the differential of . The Jacobian of is defined by . For a positive measurable function, the following equality holds:

 ∫Mf(x)JΦ(x)dHm(x)=∫N(∫x∈Φ−1({y})f(x)dHm−n(x))dHn(y). (5)

In particular, if almost everywhere, one can apply the coarea formula to to compute . Having is equivalent to have of full rank: most of the proof of our main theorem consists in showing that this property holds for certain functions of interest.

### 2.2 Background on subanalytic sets

We now give basic results on subanalytic geometry, whose proofs are given in Appendix. See [26] for a thorough review of the subject. Let be a connected real analytic submanifold possibly with boundary, whose dimension is denoted by .

A subset of is semianalytic if each point of has a neighbourhood such that is of the form

 p⋃i=1q⋂j=1Xij, (6)

where is either or for some analytic functions .

A subset of is subanalytic if for each point of , there exists a neighborhood of this point, a real analytic manifold and , a relatively compact semianalytic set of , such that is the projection of on . A function is subanalytic if its graph is subanalytic in . The set of real-valued subanalytic functions on is denoted by .

A point in a subanalytic subset of is smooth (of dimension ) if, in some neighbourhood of in , is an analytic submanifold (of dimension ). The maximal dimension of a smooth point of is called the dimension of . The smooth points of of dimension are called regular, and the other points are called singular. The set of regular points of is an open subset of , possibly empty; the set of singular points is denoted by .

[]

1. For , the set on which is analytic is an open subanalytic set of . Its complement is a subanalytic set of dimension smaller than .

Fix a subanalytic subset of . Assume that are subanalytic functions such that the image of a bounded set is bounded. Then,

1. The functions and are subanalytic.

2. The sets and are subanalytic in .

As a consequence of point (i), for , one can define its gradient everywhere but on some subanalytic set of dimension smaller than . [] Let be a subanalytic subset of . If the dimension of is smaller than , then .

As a direct corollary, we always have

 Hd(X)=Hd(Reg(X)). (7)

Write the class of subanalytic subsets of with . We have just shown that on . They form a special class of negligeable sets. We say that a property is verified almost subanalytically everywhere (a.s.e.) if the set on which it is not verified is included in a set of . For example, Lemma 2.2 implies that is defined a.s.e..

## 3 The density of expected persistence diagrams

Let be an integer. Write the collection of non-empty subsets of . Let be a continuous function. The function will be used to construct the persistence diagram and is called a filtering function: a simplex is added in the filtration at the time . Write for and for a simplex, . We make the following assumptions on :

1. Absence of interaction: For , only depends on .

2. Invariance by permutation: For and for , if is a permutation of , then .

3. Monotony: For , .

4. Compatibility: For a simplex and for , if is not a function of on some open set of , then on .

5. Smoothness: The function is subanalytic and the gradient of each of its entries (which is defined a.s.e.) is non vanishing a.s.e..

Assumptions (K2) and (K3) ensure that a filtration can be defined thanks to by:

 ∀J∈Fn, J∈K(x,r)⟺φ[J](x)≤r. (8)

Assumption (K1) means that the moment a simplex is added in the filtration only depends on the position of its vertices, but not on their relative position in the point cloud. For

, the gradient of is a vector field in . Its projection on the th coordinate is denoted by : it is a vector field in defined a.s.e.. The persistence diagram of the filtration for -dimensional homology is denoted by .

Fix . Assume that is a real analytic compact -dimensional connected submanifold possibly with boundary and that

is a random variable on

having a density with respect to the Haussdorf measure . Assume that satisfies the assumptions (K1)-(K5). Then, for , the expected measure has a density with respect to the Lebesgue measure on .

The condition that is compact can be relaxed in most cases: it is only used to ensure that the subanalytic functions appearing in the proof satisfy the boundedness condition of Lemma 2.2. For the Čech and Rips-Vietoris filtrations, one can directly verify that the function (and therefore the functions appearing in the proofs) satisfies it when . Indeed, in this case, the filtering functions are semi-algebraic.

Classical filtrations such as the Rips-Vietoris and Čech filtrations do not satisfy the full set of assumptions (K1)-(K5). Specifically, they do not satisfy the second part of assumption (K5): all singletons are included at time in those filtrations so that , and the gradient is therefore null everywhere. This leads to a well-known phenomenon on Rips-Vietoris and Čech diagrams: all the non-infinite points of the diagram for -dimensional homology are included in the vertical line . A theorem similar to Theorem 3 still holds in this case:

[] Fix . Assume that is a real analytic compact -dimensional connected submanifold and that is a random variable on having a density with respect to the Haussdorf measure . Define assumption (K5’):

1. The function is subanalytic and the gradient of its entries of size greater than 1 is non vanishing a.s.e.. Moreover, for a singleton, .

Assume that satisfies the assumptions (K1)-(K4) and (K5’). Then, for , has a density with respect to the Lebesgue measure on . Moreover, has a density with respect to the Lebesgue measure on the vertical line . The proof of Theorem 3 is very similar to the proof of Theorem 3. It is therefore relegated to the appendix.

One can easily generalize Theorem 3 and assume that the size of the point process is itself random. For , define a function satisfying the assumption (K1)-(K5). If is a finite subset of , define by the filtration associated to where is the size of . We obtain the following corollary, proven in the appendix.

###### Corollary .

Assume that has some density with respect to the law of a Poisson process on of intensity , such that . Assume that satisfies the assumptions (K1)-(K5). Then, for , has a density with respect to the Lebesgue measure on .

The condition ensures the existence of the expected diagram and is for example satisfied when is a Poisson process with finite intensity.

As the way the filtration is created is smooth, one may actually wonder whether the density of is smooth as well: it is the case as long as the way the points are sampled is smooth. Recalling that a function is said to be of class if it is times differentiable, with a continuous th derivative, we have the following result. [] Fix and assume that has some density of class with respect to . Then, for , the density of is of class .

The proof is based on classical results of continuity under the integral sign as well as an use of the implicit function theorem: it can be found in the appendix.

As a corollary of Theorem 3, we obtain the smoothness of various expected descriptors computed on persistence diagrams. For instance, the expected birth distribution and the expected death distribution have smooth densities under the same hypothesis, as they are obtained by projection of the expected diagram on some axis. Another example is the smoothness of the expected Betti curves. The th Betti number of a filtration is defined as the dimension of the th homology group of . The Betti curves are step functions which can be used as statistics, as in [29] where they are used for a classification task on time series. With few additional work (see proof in Appendix), the expected Betti curves are shown to be smooth.

###### Corollary .

Under the same hypothesis than Theorem 3, for , the expected Betti curve is a function.

## 4 Proof of Theorem 3

First, one can always replace by , as Lemma 2.2 implies that it is an open set whose complement is in . We will therefore assume that is analytic on .

Given , the different values taken by on the filtration can be written . Define the set of simplices such that . The sets form a partition of denoted by .

For a.s.e. , for , has a minimal element (for the partial order induced by inclusion).

###### Proof.

Fix with and . consider the subanalytic functions and . The set

 C(J,J′)\vcentcolon={f=0}∩{g>0}. (9)

is a subanalytic subset of . Assume that it contains some open set . On , is equal to . Therefore, it does not depend on the entries for . Hence, by assumption (K4), is actually equal to on . This is a contradiction with having on . Therefore, does not contain any open set, and all its points are singular: is in . If , similar arguments show that cannot contain any open set: it would contradict assumption (K5). On the complement of

 C\vcentcolon=⋃J≠J′⊂{1,…,n}C(J,J′), (10)

having implies that this quantity is equal to . This show the existence of a minimal element to on the complement of . This property is therefore a.s.e. satisfied. ∎

A.s.e., is locally constant.

###### Proof.

Fix a partition of induced by some filtration, with minimal elements . Consider the subanalytic functions defined, for , by

 F(x)=L∑l=1∑J∈El(φ[J](x)−φ[Jl](x)) and G(x)=∑l≠l′(φ[Jl](x))−φ[Jl′](x))2.

The set is exactly the set , which is subanalytic. The sets for all partitions of define a finite partition of the space . On each open set , the application is constant. Therefore, is locally constant everywhere but on . ∎

Therefore, the space is partitioned into a negligeable set of and some open subanalytic sets on which is constant.

Fix and assume that are the minimal elements of on . Then, for and , a.s.e. on .

###### Proof.

By minimality of , for , the subanalytic set cannot contain an open set. It is therefore in . ∎

Fix and write

 Vr=Ur \⎛⎝L⋃l=1|Jl|⋃j=1{∇jφ[Jl]=0}⎞⎠.

The complement of in is still in . For , is written , where . The integer and the simplices , depend only on . Note that is always greater than , so that cannot be included in . The map has it differential of rank 2. Indeed, take . By Lemma 4, . Also, as only depends on the entries of indexed by (assumption (K1)), . Furthermore, take in . By Lemma 4, . This implies that the differential is of rank 2.

We now compute the th persistence diagram for . Write the density of with respect to the measure on . Then,

 E[Ds[K(X)]] =R∑r=1E[1{X∈Vr}Ds[K(X)]]=R∑r=1E[1{X∈Vr}Nr∑i=1δri] =R∑r=1Nr∑i=1E[1{X∈Vr}δri]

Write the measure . To conclude, it suffices to show that this measure has a density with respect to the Lebesgue measure on . This is a consequence of the coarea formula. Define the function . We have already seen that is of rank on , so that . By the coarea formula (see Lemma 2.1), for a Borel set in ,

 μir(B)=P(Φir(X)∈B,X∈Vr) =∫Vr1{Φir(x)∈B}κ(x)dHnd(x) =∫u∈B∫x∈Φ−1ir({u})(JΦir(x))−1κ(x)dHnd−2(x)du.

Therefore, has a density with respect to the Lebesgue measure on equal to

 pir(u)=∫x∈Φ−1ir({u})(JΦir(x))−1κ(x)dHnd−2(x). (11)

Finally, has a density equal to

 p(u)=R∑r=1Nr∑i=1∫x∈Φ−1ir({u})(JΦir(x))−1κ(x)dHnd−2(x). (12)

Notice that, for fixed, the above proof, and thus the conclusion, of Theorem 3 also works if the diagrams are represented by normalized discrete measures, i.e. probability measures defined by

 Ds=1|Ds|∑r∈Dsδr. (13)

## 5 Examples

We now note that the Rips-Vietoris and the Čech filtrations satisfy the assumptions (K1)-(K4) and (K5’) when is an Euclidean space. Note that the similar arguments show that weighted versions of those filtrations (see [5]) satisfy assumptions (K1)-(K5).

### 5.1 Rips-Vietoris filtration

For the Rips-Vietoris filtration, . The function clearly satisfies (K1), (K2) and (K3). It is also subanalytic, as it is the maximum of semi-algebraic functions.

For and a simplex of size greater than one, for some indices . Those indices are locally stable, and : hypothesis (K4) is satisfied. Furthermore, on this set,

 ∇φ[{i,j}](x)=(xi−xj∥xi−xj∥,xj−xi∥xi−xj∥)≠0. (14)

Hence, (K5’) is also satisfied: both Theorem 3 and Theorem 3 are satisfied for the Rips-Vietoris filtration.

### 5.2 Čech filtration

The ball centered at of radius is denoted by . For the Čech filtration,

 φ[J](x)=infr>0{⋂j∈JB(xj,r)≠∅}. (15)

First, it is clear that (K1), (K2) and (K3) are satisfied by .

We give without proof a characterization of the Čech complex.

###### Proposition .

Let be in and fix . If the circumcenter of is in the convex hull of , then is the radius of the circumsphere of . Otherwise, its projection on the convex hull belongs to the convex hull of some subsimplex of and .

The Cayley-Menger matrix of a -simplex is the symmetric matrix of size , with zeros on the diagonal, such that for and for .

###### Proposition (see [13]).

Let be a point in general position. Then, the Cayley-Menger matrix is invertible with , where is the radius of the circumsphere of . The th other entries of the first line of are the barycentric coordinates of the circumcenter.

Therefore, the application which maps a simplex to its circumcenter is analytic, and the set on which the circumcenter of a simplex belongs in the interior of its convex hull is a subanalytic set. On such a set, the function is also analytic, as it is the square root of the inverse a matrix which is polynomial in . Furthermore, on the open set on which the circumcenter is outside the convex hull, we have shown that for some subsimplex : assumption (K4) is satisfied.

Finally, let us show that assumption (K5’) is satisfied. The previous paragraph shows the subanalyticity of . For a simplex of size greater than one, there exists some subsimplex such that is the radius of the circumsphere of . It is clear that there cannot be an open set on which this radius is constant. Thus, is a.s.e. non null.

## 6 Persistence surface as a kernel density estimator

Persistence surface is a representation of persistence diagrams introduced by Adams & al. in [1]. It consists in a convolution of a diagram with a kernel, a general idea that has been repeatedly and fruitfully exploited, with slight variations, for instance in [11, 20, 25]. For a kernel and a bandwidth matrix (e.g. a symmetric positive definite matrix), let for ,

 KH(u)=det(H)−1/2K(H−1/2⋅u). (16)

For a diagram, a kernel, a bandwidth matrix and a weight function, one defines the persistence surface of with kernel and weight function by:

 ∀u∈R2, ρ(D)(u)\vcentcolon=∑r∈Dw(r)KH(u−r)=D(wKH(u−⋅)) (17)

Assume that is some point process satisfying the assumptions of Theorem 3. Then, for , has some density with respect to the Lebesgue measure on . Therefore, , the measure having density with respect to , has a density equal to with respect to the Lebesgue measure. The mean persistence surface is exactly the convolution of by some kernel function: the persistence surface is actually a kernel density estimator of .

If a point cloud approximates a shape, then its persistence diagram (for the Čech filtration for instance) is made of numerous points with small persistences and a few meaningful points of high persistences which corresponds to the persistence diagram of the "true" shape. As one is interested in the latter points, a weight function , which is typically an increasing function of the persistence, is used to suppress the importance of the topological noise in the persistence surface. Adams & al. [1] argue that in this setting, the choice of the bandwidth matrix has few effects for statistical purposes (e.g. classification), a claim supported by numerical experiments on simple sets of synthetic data, e.g. torus, sphere, three clusters, etc.

However, in the setting where the datasets are more complicated and contain no obvious "real" shapes, one may expect the choice of the bandwidth parameter to become more critical: there are no highly persistent, easily distinguishable points in the diagrams anymore and the precise structure of the density functions of the processes becomes of interest. We now show that a cross validation approach allows the bandwidth selection task to be done in an asymptotically consistent way. This is a consequence of a generalization of Stone’s theorem [27] when observations are not random vectors but random measures.

Assume that are i.i.d. random measures on , such that there exists a deterministic constant with . Assume that the expected measure has a bounded density with respect to the Lebesgue measure on . Given a kernel and a bandwidth matrix , one defines the kernel density estimator

 ^pH(x)\vcentcolon=1NN∑i=1∫KH(x−y)μi(dy). (18)

The optimal bandwidth minimizes the Mean Integrated Square Error (MISE)

 MISE(H)\vcentcolon=E[∥p−^pH∥2]=E[∫(p(x)−^pH(x))2dx]. (19)

Of course, as is unknown, cannot be computed. Minimizing is equivalent to minimize . Define

 ^piH(x)\vcentcolon=1N−1∑j≠i∫KH(x−y)μj(dy) (20)

and

 ^J(H)\vcentcolon=1N2∑i,j∬K(2)H(x−y)μi(dx)μj(dy)−2N∑i∫^piH(x)μi(dx), (21)

where denotes the convolution of with itself. The quantity

is an unbiased estimator of

. The selected bandwidth is then chosen to be equal to .

[Stone’s theorem [27]] Assume that the kernel is nonnegative, Hölder continuous and has a maximum attained in . Also assume that the density is bounded. Then, is asymptotically optimal in the sense that

 ∥p−^p^H∥∥p−^pHopt∥−−−−→N→∞1 a.s.. (22)

Note that the gaussian kernel satisfies the assumptions of Theorem 6.

Let be i.i.d. processes on having a density with respect to the law of a Poisson process of intensity . Assume that there exists a deterministic constant with . Then, Theorem 6 can be applied to . Therefore, the cross validation procedure (21) to select the bandwidth matrix in the persistence surface ensures that the mean persistence surface

 ¯¯¯ρN\vcentcolon=1NN∑i=1ρ(Ds[K(Xi)]) (23)

is a good estimator of the density of .

## 7 Numerical illustration

Three sets of synthetic data are considered (see Figure 1). The first one (a) is made of sets of i.i.d. points uniformly sampled in the square . The second one (b) is made of samples of a clustered process: cluster’s centers are uniformly sampled in the square. Each center is then replaced with

i.i.d. points following a normal distribution of standard deviation

. The third dataset (c) is made of samples of uniform points on a torus of inner radius and outer radius . For each set, a Čech persistence diagram for -dimensional homology is computed. Persistence diagrams are then transformed under the map , so that they now live in the upper-left quadrant of the plane. Figure 2 shows the superposition of the diagrams in each class. One may observe the slight differences in the structure of the topological noise over the classes (a) and (b). The cluster of most persistent points in the diagrams of class (c) correspond to the two holes of a torus and are distinguishable from the rest of the points in the diagrams of the class, which form topological noise. The persistence diagrams are weighted by the weight function , as advised in [19] for two-dimensional point clouds. The bandwidth selection procedure will be applied to the measures having density with respect to the diagrams, e.g. a measure is a sum of weighted Dirac measures.

For each class of dataset, the score is computed for a set of bandwidth matrices of the form , for values evenly spaced on a log-scale between and . Note that the computation of only involves the computations of for points , in different diagrams. Hence, the complexity of the computation of is in , where is the sum of the number of points in the diagrams of a given class. If this is too costly, one may use a subsampling approach to estimate the integrals. The selected bandwidth were respectively . Persistence surfaces for the selected bandwidth are displayed in Figure 3. The persistence of the "true" points of the torus are sufficient to suppress the topological noise: only two yellow areas are seen in the persistence surface of the torus. Note that the two areas can be separated, whereas it is not obvious when looking at the superposition of the diagrams, and would not have been obvious with an arbitrary choice of bandwidth. The bandwidth for class (b) may look to have been chosen too big. However, there is much more variability in class (b) than in the other classes: this phenomenon explains that the density is less peaked around a few selected areas than in class (a).

Illustrations on non-synthetic data are shown in the appendix: similar behaviors are observed.

## 8 Conclusion and further works

Taking a measure point of view to represent persistence diagrams, we have shown that the expected behavior of persistence diagrams built on top of random point sets reveals to have a simple and interesting structure: a measure on with density with respect to Lebesgue measure that is as smooth as the random process generating the data points! This opens the door to the use of effective kernel density estimation techniques for the estimation of the expectation of topological features of data. Our approach and results also seem to be particularly well-suited to the use of recent results on the Lepski method for parameter selection [22]

in statistics, a research direction that deserves further exploration. As many persistence-based features considered among the literature - persistence images, birth and death distributions, Betti curves,… - can be expressed as linear functional of the discrete measure representation of diagrams, our results immediately extend to them. The ability to select the parameters on which these features are dependent in a well-founded statistical way also opens the door to a well-justified usage of persistence-based features in further supervised and un-supervised learning tasks.

## References

• [1] Henry Adams, Tegan Emerson, Michael Kirby, Rachel Neville, Chris Peterson, Patrick Shipman, Sofya Chepushtanova, Eric Hanson, Francis Motta, and Lori Ziegelmeier. Persistence images: a stable vector representation of persistent homology.

Journal of Machine Learning Research

, 18(8):1–35, 2017.
• [2] Christophe Biscio and Jesper Møller. The accumulated persistence function, a new useful functional summary statistic for topological data analysis, with a view to brain artery trees and spatial point process applications. arXiv preprint arXiv:1611.00630, 2016.
• [3] Omer Bobrowski, Matthew Kahle, Primoz Skraba, et al. Maximally persistent cycles in random geometric complexes. The Annals of Applied Probability, 27(4):2032–2060, 2017.
• [4] Peter Bubenik. Statistical topological data analysis using persistence landscapes. The Journal of Machine Learning Research, 16(1):77–102, 2015.
• [5] Mickaël Buchet, Frédéric Chazal, Steve Y Oudot, and Donald R Sheehy. Efficient and robust persistent homology for measures. Computational Geometry, 58:70–96, 2016.
• [6] F. Chazal, D. Cohen-Steiner, L. J. Guibas, F. Memoli, and S. Y. Oudot. Gromov-hausdorff stable signatures for shapes using persistence. Computer Graphics Forum (proc. SGP 2009), pages 1393–1403, 2009.
• [7] F. Chazal, V. de Silva, and S. Oudot. Persistence stability for geometric complexes. Geometriae Dedicata, 173(1):193–214, 2014.
• [8] Frédéric Chazal, Vin de Silva, Marc Glisse, and Steve Oudot. The structure and stability of persistence modules. SpringerBriefs in Mathematics. Springer, 2016.
• [9] Frédéric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, and Larry Wasserman. Stochastic convergence of persistence landscapes and silhouettes. In Proceedings of the thirtieth annual symposium on Computational geometry, page 474. ACM, 2014.
• [10] Frédéric Chazal, Marc Glisse, Catherine Labruère, and Bertrand Michel. Convergence rates for persistence diagram estimation in topological data analysis. Journal of Machine Learning Research, 16:3603–3635, 2015.
• [11] Yen-Chi Chen, Daren Wang, Alessandro Rinaldo, and Larry Wasserman. Statistical analysis of persistence intensity functions. arXiv preprint arXiv:1510.02502, 2015.
• [12] David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Stability of persistence diagrams. Discrete & Computational Geometry, 37(1):103–120, 2007.
• [13] HSM Coxeter. The circumradius of the general simplex. The Mathematical Gazette, pages 229–231, 1930.
• [14] Trinh Khanh Duy, Yasuaki Hiraoka, and Tomoyuki Shirai. Limit theorems for persistence diagrams. arXiv preprint arXiv:1612.08371, 2016.
• [15] B. T. Fasy, F. Lecci, A. Rinaldo, L. Wasserman, S. Balakrishnan, A. Singh, et al. Confidence sets for persistence diagrams. The Annals of Statistics, 42(6):2301–2339, 2014.
• [16] D. Morozov H. Edelsbrunner. Persistent homology. In Handbook of Discrete and Computational Geometry (3rd Ed - To appear). CRC Press (to appear), 2017.
• [17] Matthew Kahle, Elizabeth Meckes, et al. Limit the theorems for betti numbers of random simplicial complexes. Homology, Homotopy and Applications, 15(1):343–374, 2013.
• [18] Ludger Kaup and Burchard Kaup. Holomorphic functions of several variables: an introduction to the fundamental theory, volume 3. Walter de Gruyter, 1983.
• [19] Genki Kusano, Kenji Fukumizu, and Yasuaki Hiraoka. Kernel method for persistence diagrams via kernel embedding and weight factor. arXiv preprint arXiv:1706.03472, 2017.
• [20] Genki Kusano, Yasuaki Hiraoka, and Kenji Fukumizu. Persistence weighted gaussian kernel for topological data analysis. In International Conference on Machine Learning, pages 2004–2013, 2016.
• [21] J.H. Kwak and S. Hong. Linear Algebra. Birkhäuser Boston, 2004.
• [22] Claire Lacour, Pascal Massart, and Vincent Rivoirard. Estimator selection: a new method with applications to kernel density estimation. arXiv preprint arXiv:1607.05091, 2016.
• [23] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
• [24] F. Morgan. Geometric Measure Theory: A Beginner’s Guide. Elsevier Science, 2016.
• [25] Jan Reininghaus, Stefan Huber, Ulrich Bauer, and Roland Kwitt. A stable multi-scale kernel for topological machine learning. In

Proceedings of the IEEE conference on computer vision and pattern recognition

, pages 4741–4748, 2015.
• [26] M. Shiota. Geometry of Subanalytic and Semialgebraic Sets. Progress in mathematics. Springer, 1997.
• [27] Charles J Stone. An asymptotically optimal window selection rule for kernel density estimates. The Annals of Statistics, pages 1285–1297, 1984.
• [28] Katharine Turner, Yuriy Mileyko, Sayan Mukherjee, and John Harer. Fréchet means for distributions of persistence diagrams. Discrete & Computational Geometry, 52(1):44–70, 2014.
• [29] Yuhei Umeda. Time series classification via topological data analysis.

Transactions of the Japanese Society for Artificial Intelligence

, 32(3):D–G72_1, 2017.
• [30] D Yogeshwaran, Robert J Adler, et al. On the topology of random complexes built over stationary point processes. The Annals of Applied Probability, 25(6):3338–3380, 2015.
• [31] D. Yogeshwaran, Eliran Subag, and Robert J. Adler. Random geometric complexes in the thermodynamic regime. Probability Theory and Related Fields, 167(1):107–142, Feb 2017.

## Appendix A Proofs of the subanalytic elementary lemmas

See 2.2

###### Proof.
1. Section I.2.1 in [26] states that is subanalytic. Therefore, its complement is also subanalytic: it is enough to show that is of empty interior to conclude.

Claim: The set of points where is not analytic but is locally a real analytic manifold in is a subanalytic set of empty interior.

Proof: Assume contains an open set . Replacing by a smaller open set if necessary, there exists some local parametrization of by some analytic function , being a neighborhood of in . Denote by the gradient of with respect to the real variable . The set on which is an analytic subset of . As is the graph of a function, is made of isolated points: one can always assume that those points are not in . Therefore, there exists some neighborhood of which does not intersect . One can now apply the analytic implicit function theorem (see for instance [18, Section 8]) anywhere on : for , there exists some neighborhood and an analytic function , being a neighborhood of , such that, on

 Φ(x,u)=0⟺u=g(x).

As we also have if and only if , on and is analytic on . This is a contradiction with having not analytic in every point of .

Now, the set is the union of and of where is the projection on of . As, by definition, is of empty interior, is also of empty interior. Therefore, is of empty interior, which is equivalent to say that its dimension is smaller than .

2. See [26, Section II.1.1].

3. See [26, Section II.1.6].

See 2.2

###### Proof.

Write the dimension of . First, one can always assume that is closed, as . Therefore, there exists some real analytic manifold of dimension and a proper real analytic mapping such that (see [26, Section I.2.1]). The set can be written as the union of some compact sets for . It is enough to show that . The set can be written , where is some compact subset of . We have because is of dimension . Furthermore, as is analytic on , it is Lipschitz on . Therefore, is also null. ∎

## Appendix B Proof of Theorem 3

See 3 We indicate how to change the proof of Theorem 3 when assumption (K5’) is satisfied instead of assumption (K5). In the partition of , the set plays a special role: it corresponds to the value and contains all the singletons, which satisfy by assumption. Lemma 4 holds for and one can always define to be a minimal element of . With this convention in mind, it is straightforward to check that Lemma 4 still holds and that Lemma 4 is satisfied as well for . Now, one can define in a likewise manner the sets . For , the diagram is still decomposed , with . If , the end of the proof is similar. However, for , the pairs of simplices are made of one singleton