Topological Data Analysis (TDA) is rising field useful for the analysis of high-dimensional data structure. The popularly used tool in TDA is persistent homology, introduced by Edelsbrunner et al. in 2002 , where snapshots of the topological structure of the data set are taken at many different scales and the results are compared from one scale to the next. Singular homology is, in general, hard to compute, and almost impossible to compute for real-world data but when the given topological space is approximated from finitely many points, simplicial homology, can instead be used as it is easily computable. Persistent homology is obtained by computing simplicial homology at different scales. Previous applications of TDA and persistent homology include 3D shape segmentation , astrophysics [7, 9, 16], biology and medicines [11, 12], and neuroscience [3, 13] to name a few.
Popular representations of persistent homology include persistent diagrams and persistent barcodes. Persistent barcodes and persistent diagrams are mathematically equivalent and they demonstrate how the homological structures of the given data change according to scale. Although these are useful for data analysis, they are not necessarily compatible with typical machine learning workflows in their raw forms as they are designed as multisets and collections of intervals, respectively. One way to reconcile these persistent homology representations and machine learning algorithms is to vectorize the persistent diagrams or the persistent barcodes . There are several vectorization methods including the topological vector, persistence vector, and persistence images. These vectorization methods are, in general, more computationally efficient than kernel methods. The construction of the topology vectors and persistence vectors are straightforward and easy to implement. The persistence image is less straightforward (i.e. slower) to compute but delivers better classification results, in general. Another vectorization that we consider in this paper is the Betti sequence  which contains the Betti numbers of the homology groups of the simplicial complex built on the data set at all scales of the persistent homology.
In this paper, we prove by example that the Betti sequence is unstable with respect to the 1-Wasserstein distance. In other words, a small change in a persistent diagram leads to a large change in the 1-Wasserstein norm of the Betti Sequence. To our knowledge, the instability of the Betti sequence, although mentioned in , has not yet been explicitly shown. In practice, such large change may not be significant if finer filtration intervals are chosen but to remedy the instability in the Betti sequence we propose a new stabilized version and prove its stability. In the stabilization, we adopt a similar Gaussian-smoothing approach as in [10, 17]. In this paper, we show numerical examples that support our statement and show the validation of the proposed stabilization of the Betti sequence.
This paper will be organized as follows: Section 2 will cover the background of persistent homology and its representations and it will cover the definition and instability of the Betti sequence. Section 3 will introduce our proposed stabilization of the Betti sequence and provide a proof of its stability. Section 4 will demonstrate the normalized cumulative Betti sequence on various data sets. Section 5 contains our concluding remarks.
2. Persistent Homology and Betti Sequence
The main persistent homology representations that we consider in this paper are the persistence diagram and the persistence barcode. Both are equivalent and each one can be recovered from the other while each can be used differently in terms of numerical manipulations. Let be a given topological space. Singular homology describes the homological structure of with the -dimensional homology group . The homology group, is defined by , which is the quotient group of the kernel, , and image groups, , of the boundary map, : where is the free abelian group whose basis is the set of singular -simplices in . Roughly speaking, the rank of the -dimensional homology group , called the th Betti number, indicates how many -dimensional holes are there in . This information is useful in understanding the topological structure of . However, computing is not easy as is an arbitrary topological space in general. In fact, it is not practical to use for from real-world applications. Thus, in order to use homological features of for data analysis, instead of using directly, we use a point cloud sampled from .
We use typical building algorithms to obtain the point cloud approximation of , known as the simplicial complex . There are various ways of constructing including the Vietoris-Rips complex, where the non-negative real number is known as the filtration parameter. With the given value of , is constructed by gluing simplices whose pairwise distance is within . However, it is not known which value of approximates best. For this reason, we construct for various , which gives us the notion of persistence. The th homology, corresponding to can be defined similarly as above. Let be the Betti number for . represents the number of connected components in and the number of -dimensional cycles or holes. As we have the natural inclusion of , and a homomorphism , we have the relation between versus , which generates the graph of the persistent barcodes in the considered -dimension. On the persistent barcodes, an interval of filtration values corresponding to the same is known as a bar and indicates the -dimensional hole structure of . If we call the starting point of each bar the birth and the ending point the death we can create the persistent diagram multiset by considering all points of the form . Vectorizations of persistent diagrams and persistent barcodes and their stability are key aspects that we consider in this paper.
2.1. Definition of the Betti Sequence
As explained above, now we consider instead of by sampling finitely many distinct points from . Recall the Betti number, , is the rank of the th dimensional homology group . More specifically, represents the number of connected components in , the number of -dimensional holes, etc. As we have the natural inclusion of our filtered simplicial complex , namely , , we have a homomorphism . This defines a relationship between and , which is used to generate the persistence barcode of dimension .
The Betti sequence, or Betti curve, originally defined in 2017, is the vectorization of Betti numbers obtained in persistent homology . The Betti sequence uses the persistence barcode and a discretization of the filtration interval to define the vectorization. At each value of in the discretization, we count the number of generators existing at that filtration and that is our vector entry. We provide the formal definition of the Betti sequence below.
Given a persistence barcode of dimension with finitely many persistence intervals and a maximum filtration , let be equally spaced points in . Let be the vector whose entries count the number of persistence intervals in the barcode existing for the filtration value .
This definition suffers from the following flaw: persistence bars, if they fall entirely in between discretization values, are not counted at all and have no impact on the Betti sequence. Note that if the mesh size is exceedingly small then any bar which falls entirely in between the tau values is likely due to noise and their omission might not be an issue. However, if the mesh size is chosen poorly this could result in a major loss of information.
We propose an alternate definition of the Betti sequence which agrees with the original definition in the limit as the mesh size goes to zero.
Given a persistence barcode of dimension with finitely many bars and a maximum filtration , divide the interval into equal subintervals of length . Let be the vector whose entries count the number of bars in the barcode that exist for at least one filtration value in the th subinterval of the filtration interval .
This alternate definition has the added benefit of being easily translatable to the language of persistence diagrams making the study of the stability of the Betti sequence with respect to the p-Wasserstein metric possible.
2.2. Redefining the Betti Sequence via the Persistence Diagram
In order to discuss stability with respect to the Wasserstein metric, we need to redefine the Betti sequence in terms of the persistent diagram. Consider a persistent diagram with finitely many off-diagonal points. Then a bar on the corresponding persistent barcodes exists at some filtration value in the subinterval if and only if its birth-death point on the persistent diagram falls in the shaded region, call it , illustrated in Figures 1 and 2
More precisely, a bar on a barcode exists at some filtration value in the subinterval if and only if its birth-death point on the persistence diagram falls in one of the following four subregions of :
If the birth-death point lies in the red region, then the bar begins after and ends after .
If the birth-death point lies in the orange region, then the bar begins before and ends after .
If the birth-death point lies in the green region, then the bar begins after and ends before .
If the birth-death point lies in the blue region, then the bar begins before and ends before .
Let us now make a precise description of the as a subset of (with multiplicity). Let where
Let be a persistence diagram with finitely many off-diagonal points. The Betti sequence of is defined as
where is the cardinality of the intersection of , described above, and the persistence diagram .
2.3. Instability of the Betti Sequence
Recall the definition of the -Wasserstein distance between persistence diagrams .
The -Wasserstein distance between persistence diagrams and is
where is a partial matching of and . Note as this distance becomes the bottleneck distance.
Let and be persistence diagrams containing only finitely many off-diagonal points. The Betti sequence is not stable with respect to the 1-Wasserstein distance. That is, there exists persistence diagrams and such that there does not exist a non-negative constant such that
We will prove by example. Let be a persistence diagram with finitely many off-diagonal points and with maximum filtration . Suppose further, for diagram , that the number of subintervals, , of is fixed so that there exists exactly one birth-death point with in each of the non-overlapping parts of the regions , described above, for as seen in Figure 3. Fix an index and for any let . The Betti sequence vector, , of this persistence diagram is then, by definition, given by the following
Now consider another persistence diagram with finitely many off-diagonal points with the same maximum filtration, and the same number of intervals, . Suppose is almost an exact copy of except that has been shifted by to become . Then the persistence point is in and the Betti sequence for is
where , and . For the Betti sequence vectorization to be stable with respect to the 1-Wasserstein distance under the small perturbation of we need a non-negative constant such that
Clearly, and if we recall the definition of the 1-Wasserstein distance
where is a partial matching of and , we know that . Thus for stability we need a non-negative constant such that . However, as can be made arbitrarily small, there does not exist such a constant . Therefore the Betti sequence is unstable with respect to the 1-Wasserstein distance. ∎
Similarly, it can be shown that the Betti sequence is unstable with respect to the Wasserstein distance with , i.e. with respect to the bottleneck distance.
3. Stablized Betti Sequence
Suppose is a persistence diagram with off-diagonal points and maximum filtration . Divide the interval into equal subintervals of the form each of length as above. Let
be a collection of Gaussian distributions where, the mean of , is chosen to be and where is the covariance matrix for . Define to be and note that each and . Then the stable Betti Sequence vector is defined by
is the probability density function of Gaussianat . Note that is the determinant of the covariance matrix, .
The choice of is still an open question. Ideally, we would want to use a sharp Gaussian and so should be defined so that is essentially the same for every persistence point in not “near” the boundary of . The choice of should be further studied in future work.
Let be a persistence diagram of finite size and be the persistence diagram obtained by perturbing by an arbitrary such that . Then there exists a non-negative constant for any such that
Let be the number of off-diagonal points in . Let be the partial matching that realizes the 1-Wasserstein distance between and . For a fixed we have
As is continuously differentiable it is also Lipschitz continuous with Lipschitz constant . We get
If we let we have the desired result. ∎
Returning to a simplified version of the example used to show that the Betti sequence was unstable, we will now show that Theorem 3.3 is satisfied for this example.
Let be the persistence diagram that contains two off-diagonal points: and (see Figure 4) for any .
The stable Betti sequence for the persistence diagram , using maximum filtration 1 and two subintervals, is easily computed as the vector where
Now let be the persistence diagram, pictured in Figure 5, that contains two off-diagonal points: and , for any .
The stable Betti sequence for the persistence diagram , using maximum filtration 1 and two subintervals, is also easily computed as the vector where
Thus, we obtain
Note that for all , the absolute value of the first entry in the vector above is less than or equal to that of the second entry in the vector. Thus we obtain
Recall, in order to satisfy Theorem 3.3 we need
This means that we need to find a constant such that
Figure 6 contains the graph of the left hand side of the inequality above versus . The graph attains a maximum value of approximately as goes to 0. Thus there exists , say, , so that
and Theorem 3.3 is satisfied.
4. Numerical example
For the numerical experiments, we first consider four point clouds: a uniform lattice of points with a small uniform random coordinate perturbation of magnitude at most on , a uniformly random distribution on , points drawn from the Sierpinski triangle created by the chaos game  and a uniformly random distribution with a square hole of on . A sample point cloud for each type is shown in Figure 7.
Figure 8 shows a sample persistence barcode and diagram for the uniform random data (left) and Sierpinski data (right) for both zero-dimensional (red) and one-dimensional homology (blue). We observe the difference in each barcode and diagram between the uniform and Sierpinski data. Here we note that the Sierpinski data is generated with the uniform random sampling within the chaos game.
We will show the instability of the original Betti sequence and compare the results with the stable Betti sequence. For the stable Betti sequence, in the absence of an ideal covariance matrix for each Gaussian , we use the adapted Gaussian-smoothing approach as defined below: Consider the following set,
where is a positive constant. Since the instability we described above is induced near the domain boundary, we consider points outside but near the boundary of . For the numerical example, we consider a sharp Gaussian such that all the points in participate in the Betti sequence. Since the instability is more sensitive to the lower indices of the Betti sequence, we choose the free parameter as below (notice that if , it reduces to the original Betti sequence)
Let be and be the cardinality of . Then define a new stable Betti sequence as
is the original Betti sequence with each domain extended by , which can also be viewed as the Betti sequence Gaussian-smoothed with a sharp truncation near the domain boundary. Further, we define the cumulative Betti sequence, of recursively as
Then the normalized cumulative vector is defined as
We first show the instability of the Betti sequence. Consider the interval, , where . We consider a uniform lattice of points with the total number of bins for the Betti sequence, , and the total number of data points, . We rescale the lattice by mapping to while keeping the filtration interval . For this case, the shortest lattice interval becomes the same as the filtration interval when and the corresponding normalized cumulative Betti sequence becomes
because the shortest lattice interval coincides with the domain interval, , in the persistent diagram. However, if , the shortest interval becomes less than and so the corresponding Betti sequence is
Here note that if the birth of all the bars in the barcode remain in the second domain and the death of all the bars still remain in the second domain because the perturbation is chosen small enough. Thus if the perturbation is small enough, the first element of the Betti sequence has the value of or while the same element in the stable Betti sequence remains almost same under the small perturbation . The left figure in Figure 9 shows the first element of the Betti sequence versus , blue for the Betti sequence and red for the stable Betti sequence with the perturbation – values of were chosen uniformly. The right figure shows the same plot for the uniform random data whose domain is also . As shown in the figure, the original Betti sequence fluctuated between and as expected in the left figure and it also fluctuates more than the stable Betti sequence in the right figure.
We use samples for each case and plot all the Betti sequences. The figure shows that the stable Betti sequences yield more sharp patterns while maintaining similar vector structure overall. In addition, in each data set the stable Betti sequences are more homogeneous than the Betti sequences which is promising with regards to possible machine learning applications.
5. Concluding Remarks
Topological data analysis and its main tool, persistent homology, has recently gained attention in the scientific community and has proven to be highly useful in various applications. Recently, sizable research has been conducted to combine topological data analysis and machine learning. However, the representations of persistent homology, the persistence diagram and barcode, in their raw forms are not suitable to incorporate into a machine learning workflow and proper feature maps are necessary including vectorization methods. In this paper, we considered the Betti sequence, as a vectorization and showed by example its instability with respect to the 1-Wasserstein distance. In addition, we proposed a stable Betti sequence and proved its stability. With numerical examples, we devised a cumulative stable Betti sequence and showed that the stable Betti sequence was able to achieve a faithful representation of the Betti sequence in that it performs better with the smaller number of the filtration intervals at distinguishing data sets. Our future research will incorporate the proposed stable Betti sequence into machine learning algorithms to study its effectiveness in various applications.
MJ was funded, in part, by the Doctoral Dissertation Fellowship of the Department of Mathematics at the University at Buffalo. JHJ has been supported by Samsung Science & Technology Foundation under grant number SSTF-BA1802-02.
-  (2017) Persistence images: a stable vector representation of persistent homology. Journal of Machine Learning Research 18 (8), pp. 1–35. External Links: Cited by: §1.
-  (1988) Fractals everywhere. Academic Press Inc., Boston, MA. Cited by: §4.
-  (2016-03) Persistent homology analysis of brain artery trees. Ann. Appl. Stat. 10 (1), pp. 198–218. External Links: Cited by: §1.
-  (2009-04) Topology and data. Bulletin of The American Mathematical Society 46, pp. 255–308. External Links: Cited by: §1.
-  (2015) Stable topological signatures for points on 3d shapes. Computer Graphics Forum 34 (5), pp. 1–12. External Links: Cited by: §1.
-  (2020) Persistence curves: a canonical framework for summarizing persistence diagrams. External Links: Cited by: §1.
-  (2017) Persistent homology and non-gaussianity. arXiv preprint arXiv:1712.08159. Cited by: §1.
-  (2002-11-01) Topological persistence and simplification. Discrete & Computational Geometry 28 (4), pp. 511–533. External Links: Cited by: §1.
-  (2020) Persistent homology in cosmic shear: constraining parameters with topological data analysis. arXiv preprint arXiv:2007.13724. External Links: Cited by: §1.
-  (2020) The interconnectivity vector: a finite-dimensional representation of persistent homology. arXiv preprint arXiv:2011.11579. Cited by: §1, §3.
-  (2020) Topological data analysis of zebrafish patterns. Proceedings of the National Academy of Sciences 117 (10), pp. 5113–5124. External Links: Cited by: §1.
-  (2020) Topological data analysis of vascular disease: i a theoretical framework. Frontiers in Applied Mathematics and Statistics 6:34. Cited by: §1.
-  (2019) The importance of the whole: topological data analysis for the network neuroscientist. Network Neuroscience 3 (3), pp. 656–673. External Links: Cited by: §1.
-  (2017) Time series classification via topological data analysis. Information and Media Technologies 12 (), pp. 228–239. External Links: Cited by: §1, §2.1.
-  (1969) Markov processes over denumerable products of spaces, describing large systems of automata. Problems Inform. Transmission 5 (3), pp. 47–52. External Links: Cited by: §2.3.
-  (2019) Finding cosmic voids and filament loops using topological data analysis. Astronomy and Computing 27, pp. 34 – 52. External Links: Cited by: §1.
-  (2020-09-01) Persistence codebooks for topological data analysis. Artificial Intelligence Review. External Links: Cited by: §1, §3.