Instability of the Betti Sequence for Persistent Homology and a Stabilized Version of the Betti Sequence

09/19/2021 ∙ by Megan Johnson, et al. ∙ POSTECH University at Buffalo 0

Topological Data Analysis (TDA), a relatively new field of data analysis, has proved very useful in a variety of applications. The main persistence tool from TDA is persistent homology in which data structure is examined at many scales. Representations of persistent homology include persistence barcodes and persistence diagrams, both of which are not straightforward to reconcile with traditional machine learning algorithms as they are sets of intervals or multisets. The problem of faithfully representing barcodes and persistent diagrams has been pursued along two main avenues: kernel methods and vectorizations. One vectorization is the Betti sequence, or Betti curve, derived from the persistence barcode. While the Betti sequence has been used in classification problems in various applications, to our knowledge, the stability of the sequence has never before been discussed. In this paper we show that the Betti sequence is unstable under the 1-Wasserstein metric with regards to small perturbations in the barcode from which it is calculated. In addition, we propose a novel stabilized version of the Betti sequence based on the Gaussian smoothing seen in the Stable Persistence Bag of Words for persistent homology. We then introduce the normalized cumulative Betti sequence and provide numerical examples that support the main statement of the paper.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Topological Data Analysis (TDA) is rising field useful for the analysis of high-dimensional data structure

[4]. The popularly used tool in TDA is persistent homology, introduced by Edelsbrunner et al. in 2002 [8], where snapshots of the topological structure of the data set are taken at many different scales and the results are compared from one scale to the next. Singular homology is, in general, hard to compute, and almost impossible to compute for real-world data but when the given topological space is approximated from finitely many points, simplicial homology, can instead be used as it is easily computable. Persistent homology is obtained by computing simplicial homology at different scales. Previous applications of TDA and persistent homology include 3D shape segmentation [5], astrophysics [7, 9, 16], biology and medicines [11, 12], and neuroscience [3, 13] to name a few.

Popular representations of persistent homology include persistent diagrams and persistent barcodes. Persistent barcodes and persistent diagrams are mathematically equivalent and they demonstrate how the homological structures of the given data change according to scale. Although these are useful for data analysis, they are not necessarily compatible with typical machine learning workflows in their raw forms as they are designed as multisets and collections of intervals, respectively. One way to reconcile these persistent homology representations and machine learning algorithms is to vectorize the persistent diagrams or the persistent barcodes [1]. There are several vectorization methods including the topological vector, persistence vector, and persistence images. These vectorization methods are, in general, more computationally efficient than kernel methods. The construction of the topology vectors and persistence vectors are straightforward and easy to implement. The persistence image is less straightforward (i.e. slower) to compute but delivers better classification results, in general. Another vectorization that we consider in this paper is the Betti sequence [14] which contains the Betti numbers of the homology groups of the simplicial complex built on the data set at all scales of the persistent homology.

In this paper, we prove by example that the Betti sequence is unstable with respect to the 1-Wasserstein distance. In other words, a small change in a persistent diagram leads to a large change in the 1-Wasserstein norm of the Betti Sequence. To our knowledge, the instability of the Betti sequence, although mentioned in [6], has not yet been explicitly shown. In practice, such large change may not be significant if finer filtration intervals are chosen but to remedy the instability in the Betti sequence we propose a new stabilized version and prove its stability. In the stabilization, we adopt a similar Gaussian-smoothing approach as in [10, 17]. In this paper, we show numerical examples that support our statement and show the validation of the proposed stabilization of the Betti sequence.

This paper will be organized as follows: Section 2 will cover the background of persistent homology and its representations and it will cover the definition and instability of the Betti sequence. Section 3 will introduce our proposed stabilization of the Betti sequence and provide a proof of its stability. Section 4 will demonstrate the normalized cumulative Betti sequence on various data sets. Section 5 contains our concluding remarks.

2. Persistent Homology and Betti Sequence

The main persistent homology representations that we consider in this paper are the persistence diagram and the persistence barcode. Both are equivalent and each one can be recovered from the other while each can be used differently in terms of numerical manipulations. Let be a given topological space. Singular homology describes the homological structure of with the -dimensional homology group . The homology group, is defined by , which is the quotient group of the kernel, , and image groups, , of the boundary map, : where is the free abelian group whose basis is the set of singular -simplices in . Roughly speaking, the rank of the -dimensional homology group , called the th Betti number, indicates how many -dimensional holes are there in . This information is useful in understanding the topological structure of . However, computing is not easy as is an arbitrary topological space in general. In fact, it is not practical to use for from real-world applications. Thus, in order to use homological features of for data analysis, instead of using directly, we use a point cloud sampled from .

We use typical building algorithms to obtain the point cloud approximation of , known as the simplicial complex . There are various ways of constructing including the Vietoris-Rips complex, where the non-negative real number is known as the filtration parameter. With the given value of , is constructed by gluing simplices whose pairwise distance is within . However, it is not known which value of approximates best. For this reason, we construct for various , which gives us the notion of persistence. The th homology, corresponding to can be defined similarly as above. Let be the Betti number for . represents the number of connected components in and the number of -dimensional cycles or holes. As we have the natural inclusion of , and a homomorphism , we have the relation between versus , which generates the graph of the persistent barcodes in the considered -dimension. On the persistent barcodes, an interval of filtration values corresponding to the same is known as a bar and indicates the -dimensional hole structure of . If we call the starting point of each bar the birth and the ending point the death we can create the persistent diagram multiset by considering all points of the form . Vectorizations of persistent diagrams and persistent barcodes and their stability are key aspects that we consider in this paper.

2.1. Definition of the Betti Sequence

As explained above, now we consider instead of by sampling finitely many distinct points from . Recall the Betti number, , is the rank of the th dimensional homology group . More specifically, represents the number of connected components in , the number of -dimensional holes, etc. As we have the natural inclusion of our filtered simplicial complex , namely , , we have a homomorphism . This defines a relationship between and , which is used to generate the persistence barcode of dimension .

The Betti sequence, or Betti curve, originally defined in 2017, is the vectorization of Betti numbers obtained in persistent homology [14]. The Betti sequence uses the persistence barcode and a discretization of the filtration interval to define the vectorization. At each value of in the discretization, we count the number of generators existing at that filtration and that is our vector entry. We provide the formal definition of the Betti sequence below.

Definition 2.1.

Given a persistence barcode of dimension with finitely many persistence intervals and a maximum filtration , let be equally spaced points in . Let be the vector whose entries count the number of persistence intervals in the barcode existing for the filtration value .

This definition suffers from the following flaw: persistence bars, if they fall entirely in between discretization values, are not counted at all and have no impact on the Betti sequence. Note that if the mesh size is exceedingly small then any bar which falls entirely in between the tau values is likely due to noise and their omission might not be an issue. However, if the mesh size is chosen poorly this could result in a major loss of information.

We propose an alternate definition of the Betti sequence which agrees with the original definition in the limit as the mesh size goes to zero.

Definition 2.2.

Given a persistence barcode of dimension with finitely many bars and a maximum filtration , divide the interval into equal subintervals of length . Let be the vector whose entries count the number of bars in the barcode that exist for at least one filtration value in the th subinterval of the filtration interval .

This alternate definition has the added benefit of being easily translatable to the language of persistence diagrams making the study of the stability of the Betti sequence with respect to the p-Wasserstein metric possible.

2.2. Redefining the Betti Sequence via the Persistence Diagram

In order to discuss stability with respect to the Wasserstein metric, we need to redefine the Betti sequence in terms of the persistent diagram. Consider a persistent diagram with finitely many off-diagonal points. Then a bar on the corresponding persistent barcodes exists at some filtration value in the subinterval if and only if its birth-death point on the persistent diagram falls in the shaded region, call it , illustrated in Figures 1 and 2

Birth

Death

Figure 1. The shaded region, , of the persistence diagram corresponding to the filtration interval on a barcode. Here is the diagonal.

Birth

Death

(

Figure 2. The shaded region, , of the persistence diagram corresponding to the filtration interval on a barcode for four different cases. Here is the diagonal.

More precisely, a bar on a barcode exists at some filtration value in the subinterval if and only if its birth-death point on the persistence diagram falls in one of the following four subregions of :

  • If the birth-death point lies in the red region, then the bar begins after and ends after .

  • If the birth-death point lies in the orange region, then the bar begins before and ends after .

  • If the birth-death point lies in the green region, then the bar begins after and ends before .

  • If the birth-death point lies in the blue region, then the bar begins before and ends before .

Let us now make a precise description of the as a subset of (with multiplicity). Let where

and

Definition 2.3.

Let be a persistence diagram with finitely many off-diagonal points. The Betti sequence of is defined as

where is the cardinality of the intersection of , described above, and the persistence diagram .

2.3. Instability of the Betti Sequence

Recall the definition of the -Wasserstein distance between persistence diagrams [15].

Definition 2.4.

The -Wasserstein distance between persistence diagrams and is

where is a partial matching of and . Note as this distance becomes the bottleneck distance.

Theorem 2.5.

Let and be persistence diagrams containing only finitely many off-diagonal points. The Betti sequence is not stable with respect to the 1-Wasserstein distance. That is, there exists persistence diagrams and such that there does not exist a non-negative constant such that

Proof.

We will prove by example. Let be a persistence diagram with finitely many off-diagonal points and with maximum filtration . Suppose further, for diagram , that the number of subintervals, , of is fixed so that there exists exactly one birth-death point with in each of the non-overlapping parts of the regions , described above, for as seen in Figure 3. Fix an index and for any let . The Betti sequence vector, , of this persistence diagram is then, by definition, given by the following

Figure 3. Left: The persistence diagram where all persistence points exist in the non-overlapping regions of the (the shaded triangles). Right: The corresponding persistence barcode.

Now consider another persistence diagram with finitely many off-diagonal points with the same maximum filtration, and the same number of intervals, . Suppose is almost an exact copy of except that has been shifted by to become . Then the persistence point is in and the Betti sequence for is

where , and . For the Betti sequence vectorization to be stable with respect to the 1-Wasserstein distance under the small perturbation of we need a non-negative constant such that

Clearly, and if we recall the definition of the 1-Wasserstein distance

where is a partial matching of and , we know that . Thus for stability we need a non-negative constant such that . However, as can be made arbitrarily small, there does not exist such a constant . Therefore the Betti sequence is unstable with respect to the 1-Wasserstein distance. ∎

Remark 2.6.

Similarly, it can be shown that the Betti sequence is unstable with respect to the Wasserstein distance with , i.e. with respect to the bottleneck distance.

3. Stablized Betti Sequence

We now propose a stabilized version of the Betti sequence inspired by the Gaussian smoothing techniques seen in [10, 17] and prove its stability with respect to the 1-Wasserstein distance.

Definition 3.1.

Suppose is a persistence diagram with off-diagonal points and maximum filtration . Divide the interval into equal subintervals of the form each of length as above. Let

be a collection of Gaussian distributions where

, the mean of , is chosen to be and where is the covariance matrix for . Define to be and note that each and . Then the stable Betti Sequence vector is defined by

where

is the probability density function of Gaussian

at . Note that is the determinant of the covariance matrix, .

Remark 3.2.

The choice of is still an open question. Ideally, we would want to use a sharp Gaussian and so should be defined so that is essentially the same for every persistence point in not “near” the boundary of . The choice of should be further studied in future work.

Theorem 3.3.

Let be a persistence diagram of finite size and be the persistence diagram obtained by perturbing by an arbitrary such that . Then there exists a non-negative constant for any such that

Proof.

Let be the number of off-diagonal points in . Let be the partial matching that realizes the 1-Wasserstein distance between and . For a fixed we have

As is continuously differentiable it is also Lipschitz continuous with Lipschitz constant . We get

If we let we have the desired result. ∎

Example 3.4.

Returning to a simplified version of the example used to show that the Betti sequence was unstable, we will now show that Theorem 3.3 is satisfied for this example.

Let be the persistence diagram that contains two off-diagonal points: and (see Figure 4) for any .

Figure 4. Left, the persistence diagram with two persistence points and . The dashed blue lines outline the region used in the definition of the Betti sequence and the red dashed lines outline the region . Right, its corresponding persistence barcode.

The stable Betti sequence for the persistence diagram , using maximum filtration 1 and two subintervals, is easily computed as the vector where

Now let be the persistence diagram, pictured in Figure 5, that contains two off-diagonal points: and , for any .

Figure 5. Left, the persistence diagram with two persistence points and . The dashed blue lines outline the region used in the definition of the Betti sequence and the red dashed lines outline the region . Right, its corresponding persistence barcode.

The stable Betti sequence for the persistence diagram , using maximum filtration 1 and two subintervals, is also easily computed as the vector where

Thus, we obtain

Note that for all , the absolute value of the first entry in the vector above is less than or equal to that of the second entry in the vector. Thus we obtain

Figure 6. The plot of the ratio of the -norm of the change in the stable Betti sequence to the 1-Wasserstein distance between persistence diagrams and .

Recall, in order to satisfy Theorem 3.3 we need

This means that we need to find a constant such that

Figure 6 contains the graph of the left hand side of the inequality above versus . The graph attains a maximum value of approximately as goes to 0. Thus there exists , say, , so that

and Theorem 3.3 is satisfied.

4. Numerical example

For the numerical experiments, we first consider four point clouds: a uniform lattice of points with a small uniform random coordinate perturbation of magnitude at most on , a uniformly random distribution on , points drawn from the Sierpinski triangle created by the chaos game [2] and a uniformly random distribution with a square hole of on . A sample point cloud for each type is shown in Figure 7.

Figure 7. Top: a uniform lattice of points with a small uniform random coordinate perturbation of magnitude at most and a uniform random distribution on . Bottom: points drawn from the Sierpinski triangle created by the chaos game and a uniformly random distribution with a square hole of removed on .

Figure 8 shows a sample persistence barcode and diagram for the uniform random data (left) and Sierpinski data (right) for both zero-dimensional (red) and one-dimensional homology (blue). We observe the difference in each barcode and diagram between the uniform and Sierpinski data. Here we note that the Sierpinski data is generated with the uniform random sampling within the chaos game.

Figure 8. Persistence barcode and diagram. Top: Uniform data. Bottom: Sierpinski data.

We will show the instability of the original Betti sequence and compare the results with the stable Betti sequence. For the stable Betti sequence, in the absence of an ideal covariance matrix for each Gaussian , we use the adapted Gaussian-smoothing approach as defined below: Consider the following set,

where is a positive constant. Since the instability we described above is induced near the domain boundary, we consider points outside but near the boundary of . For the numerical example, we consider a sharp Gaussian such that all the points in participate in the Betti sequence. Since the instability is more sensitive to the lower indices of the Betti sequence, we choose the free parameter as below (notice that if , it reduces to the original Betti sequence)

Let be and be the cardinality of . Then define a new stable Betti sequence as

is the original Betti sequence with each domain extended by , which can also be viewed as the Betti sequence Gaussian-smoothed with a sharp truncation near the domain boundary. Further, we define the cumulative Betti sequence, of recursively as

Then the normalized cumulative vector is defined as

We first show the instability of the Betti sequence. Consider the interval, , where . We consider a uniform lattice of points with the total number of bins for the Betti sequence, , and the total number of data points, . We rescale the lattice by mapping to while keeping the filtration interval . For this case, the shortest lattice interval becomes the same as the filtration interval when and the corresponding normalized cumulative Betti sequence becomes

because the shortest lattice interval coincides with the domain interval, , in the persistent diagram. However, if , the shortest interval becomes less than and so the corresponding Betti sequence is

Here note that if the birth of all the bars in the barcode remain in the second domain and the death of all the bars still remain in the second domain because the perturbation is chosen small enough. Thus if the perturbation is small enough, the first element of the Betti sequence has the value of or while the same element in the stable Betti sequence remains almost same under the small perturbation . The left figure in Figure 9 shows the first element of the Betti sequence versus , blue for the Betti sequence and red for the stable Betti sequence with the perturbation values of were chosen uniformly. The right figure shows the same plot for the uniform random data whose domain is also . As shown in the figure, the original Betti sequence fluctuated between and as expected in the left figure and it also fluctuates more than the stable Betti sequence in the right figure.

Figure 9. , the first element in the sequence, versus . Blue: Betti sequence. Red: Stable Betti sequence. Left: Perturbed lattice with . Right: Uniform random data.

Now in Figure 10 we show both original Betti sequence (blue) and stable Betti sequence (red) for the four cases shown in Figure 7 with the fixed domain size and .

We use samples for each case and plot all the Betti sequences. The figure shows that the stable Betti sequences yield more sharp patterns while maintaining similar vector structure overall. In addition, in each data set the stable Betti sequences are more homogeneous than the Betti sequences which is promising with regards to possible machine learning applications.

Figure 10. The Betti sequence (blue) and stable Betti sequence (red) with fixed domain size, . Top: a lattice with a small perturbation and a uniform random data. Bottom: a Sierpinski data and a uniform random with a square hole. Each point sequence was computed with and there were points in each data set.

5. Concluding Remarks

Topological data analysis and its main tool, persistent homology, has recently gained attention in the scientific community and has proven to be highly useful in various applications. Recently, sizable research has been conducted to combine topological data analysis and machine learning. However, the representations of persistent homology, the persistence diagram and barcode, in their raw forms are not suitable to incorporate into a machine learning workflow and proper feature maps are necessary including vectorization methods. In this paper, we considered the Betti sequence, as a vectorization and showed by example its instability with respect to the 1-Wasserstein distance. In addition, we proposed a stable Betti sequence and proved its stability. With numerical examples, we devised a cumulative stable Betti sequence and showed that the stable Betti sequence was able to achieve a faithful representation of the Betti sequence in that it performs better with the smaller number of the filtration intervals at distinguishing data sets. Our future research will incorporate the proposed stable Betti sequence into machine learning algorithms to study its effectiveness in various applications.

Acknowledgments

MJ was funded, in part, by the Doctoral Dissertation Fellowship of the Department of Mathematics at the University at Buffalo. JHJ has been supported by Samsung Science & Technology Foundation under grant number SSTF-BA1802-02.

References

  • [1] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier (2017) Persistence images: a stable vector representation of persistent homology. Journal of Machine Learning Research 18 (8), pp. 1–35. External Links: Link Cited by: §1.
  • [2] M. Barnsley (1988) Fractals everywhere. Academic Press Inc., Boston, MA. Cited by: §4.
  • [3] P. Bendich, J. S. Marron, E. Miller, A. Pieloch, and S. Skwerer (2016-03) Persistent homology analysis of brain artery trees. Ann. Appl. Stat. 10 (1), pp. 198–218. External Links: Document, Link Cited by: §1.
  • [4] G. Carlsson (2009-04) Topology and data. Bulletin of The American Mathematical Society 46, pp. 255–308. External Links: Document Cited by: §1.
  • [5] M. Carrière, S. Y. Oudot, and M. Ovsjanikov (2015) Stable topological signatures for points on 3d shapes. Computer Graphics Forum 34 (5), pp. 1–12. External Links: Document, Link, https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12692 Cited by: §1.
  • [6] Y. Chung and A. Lawson (2020) Persistence curves: a canonical framework for summarizing persistence diagrams. External Links: 1904.07768 Cited by: §1.
  • [7] A. Cole and G. Shiu (2017) Persistent homology and non-gaussianity. arXiv preprint arXiv:1712.08159. Cited by: §1.
  • [8] Edelsbrunner, Letscher, and Zomorodian (2002-11-01) Topological persistence and simplification. Discrete & Computational Geometry 28 (4), pp. 511–533. External Links: ISSN 1432-0444, Document, Link Cited by: §1.
  • [9] S. Heydenreich, B. Brück, and J. Harnois-Déraps (2020) Persistent homology in cosmic shear: constraining parameters with topological data analysis. arXiv preprint arXiv:2007.13724. External Links: 2007.13724 Cited by: §1.
  • [10] M. Johnson and J. Jung (2020) The interconnectivity vector: a finite-dimensional representation of persistent homology. arXiv preprint arXiv:2011.11579. Cited by: §1, §3.
  • [11] M. R. McGuirl, A. Volkening, and B. Sandstede (2020) Topological data analysis of zebrafish patterns. Proceedings of the National Academy of Sciences 117 (10), pp. 5113–5124. External Links: Document, ISSN 0027-8424, Link, https://www.pnas.org/content/117/10/5113.full.pdf Cited by: §1.
  • [12] J. Nicponski and J. Jung (2020) Topological data analysis of vascular disease: i a theoretical framework. Frontiers in Applied Mathematics and Statistics 6:34. Cited by: §1.
  • [13] A. E. Sizemore, J. E. Phillips-Cremins, R. Ghrist, and D. S. Bassett (2019) The importance of the whole: topological data analysis for the network neuroscientist. Network Neuroscience 3 (3), pp. 656–673. External Links: Document, Link, https://doi.org/10.1162/netn_a_00073 Cited by: §1.
  • [14] Y. Umeda (2017) Time series classification via topological data analysis. Information and Media Technologies 12 (), pp. 228–239. External Links: Document Cited by: §1, §2.1.
  • [15] L. N. Vaserstein (1969) Markov processes over denumerable products of spaces, describing large systems of automata. Problems Inform. Transmission 5 (3), pp. 47–52. External Links: Link Cited by: §2.3.
  • [16] X. Xu, J. Cisewski-Kehe, S.B. Green, and D. Nagai (2019) Finding cosmic voids and filament loops using topological data analysis. Astronomy and Computing 27, pp. 34 – 52. External Links: ISSN 2213-1337, Document, Link Cited by: §1.
  • [17] B. Zieliński, M. Lipiński, M. Juda, M. Zeppelzauer, and P. Dłotko (2020-09-01) Persistence codebooks for topological data analysis. Artificial Intelligence Review. External Links: ISSN 1573-7462, Document, Link Cited by: §1, §3.