In topological data analysis a common goal is to extract topological information from a finite set of sample points with a symmetric distance function . Often one assumes that is a metric; that is, that for each , and that the triangle inequality holds. While the sample points can typically be considered as elements of some , we will instead consider as a weighted graph, or a finite metric space (if happens to be a metric).
Naively, a finite topological space would just be a finite set of disconnected points. A widely used tool to add topological structure to such a weighted graph is a family of abstract simplicial complexes called Rips-filtration. For every it is given by , and has the property that for . This complex is, by construction, a flag complex; that is, each simplex is included in if and only if every edge is included in . Alternatively, we can consider it as the clique-complex of the graph with all edges with weight less than , where edges with weight are missing. In an abuse of notation, we will not distinguish between finite flag-complexes and graphs, nor between filtered flag-complexes and weighted graphs.
Note that we could have permitted equality in the definition of instead; which intervals contain which of their endpoints is an endless source of subtleties that are mostly irrelevant in practice. We encourage readers to ignore these subtleties, but will strive to get them right.
In order to obtain information about the topology of the underlying metric space, one studies the persistence module
. This is a finite collection of finite-dimensional vector spacesover a field with commuting linear maps for , over a totally ordered set. In particular, we consider the homology vector spaces that partially describe the topology of at scale and dimension . Since for , the embeddings induce linear maps . An element in the range corresponds to a topological feature that persists (at least) from scale to . It is intuitively obvious that long-lived features (), are more interesting for understanding the structure of than short-lived features . Even though the collection is formally infinite, it is “finite in practice”, as for finite weighted graphs it decomposes into a finite number of intervals with identical complexes in each interval. Most of our theory will concern such “practically finite” collections of finite-dimensional vector spaces . For a survey of persistent homology see [Car09] and the references therein.
The computation of the persistence diagram typically scales badly with the number of sample points: The distance matrix will have entries, each of which needs to be computed and potentially stored in memory. The number of simplices of dimension scales like for , and there are simplices in total.
A popular strategy to reduce the number of simplices is subsampling and thresholding: We first do a subsampling step, i.e., we take a subset of sample points that is dense enough in with respect to some parameter , and then consider only the corresponding persistence diagram in the range for some threshold . In order to obtain proper performance, the scale quotient needs to be bounded.
The procedure of subsampling and thresholding allows to resolve certain scales, slightly smaller than , in the persistence diagram relatively precisely, up to a relative error depending on , without needing too high numbers of simplices. It does not allow us to resolve relatively long-lasting features. That is, we will find these features but cannot find out how long they last; we cannot use this approach to find out whether features present at are actually the same. In this work, we address these shortcomings by the construction of a sparsified approximate Rips-complex. Compared to the general pipeline for topological data analysis, our approach looks somewhat like the following:
The groundbreaking work of D. Sheehy [She13] proposes one way of obtaining a sparsified complex for metric spaces (weighted graphs that fulfill the triangle inequality): One constructs a length function that is for most pairs of points, and is therefore sparse (and obviously breaks the triangle inequality). This gives rise to a Rips-complex that is multiplicatively interleaved with the original one. This sparsified complex has, for large finite spaces of bounded doubling dimension , asymptotically only linearly many simplices in (with, just like in subsampling and thresholding, bad scaling in the other parameters). Further, it is shown in [She13] that the sparsified complex can be constructed in asymptotic time (and especially only many evaluations of the distance function). [She13] does not consider the interplay with subsampling.
From this, and the publication date of [She13], one might assume that our work is already done, and sparsified complexes and relative-error approximations are today standard in the field of applied topological data analysis. This is not the case, see e.g.[OPT17]. There exists, to the best of our knowledge, no implementation of [She13], at all. We believe that this is in part due to the construction in [She13] being excessively complicated.
In this paper we will develop and benchmark a practical sparsification scheme. We made great efforts to ensure a construction that is easy to implement in a self-contained way, and easy to understand and modify. On the other hand, it is not optimized for provable asymptotic worst-case complexity, nor for clever time-saving tricks. However, the computational resources spent on sparsification are negligible compared to the computation of persistent homology of the sparsified complex, and many sample sets will be subsampled to a much smaller subset anyway.
We implemented this sparsification scheme in the julia language, and will make it avaiable soon. We believe that our scheme should outperform the proposed schemes [She13] in practice, but we cannot put this to the test, lacking any reference implementation of alternative approaches.
2 Persistence modules and diagrams
The persistence module of the Rips complexes is given by
where consists of the homology vector spaces , and is given by the embeddings for .
The following well-known classification theorem for persistence modules is the foundation of most of the field of topological data analysis:
Let be a sequence of finite-dimensional vector spaces, such that only finitely many , and let be linear maps. We set for , and .
Then there exists a collection of numbers and vectors , for and such that:
for , we have , and
for , we have , and
the collection forms a basis for .
In the notation of the theorem, is called the life-time of a vector ; it is said to persist from birth (exclusive, before) to death (inclusive, after) . A death at formally corresponds to vectors that are never in the kernel (for metric spaces this will be a single vector in , corresponding to the fact that we always have a connected component). Likewise, a birth at formally corresponds to vectors that are present in the very first space; for metric spaces, we get one such vector in for each distinct point.
Persistence modules are (up to changes of coordinates) determined by the ranks for , or, equivalently, by the numbers , as , and .
Our persistence module is parametrized over , and not over ; however, since we only consider finite metric spaces , we can parametrize over the finite set after sorting it. In an abuse of notation, we will do so implicitly.
In the following we will discuss the visualization of persistence modules, what we mean by approximation, and the need of approximation due to computational scaling.
2.1 Persistence diagrams
There are two popular visualizations of persistence modules: First, in bar codes, one collects the intervals and draws them, annotated with the dimension , and possibly a representative (in the quotient space ). If there are many bars stretching over the same interval, one typically draws only a single bar and annotates it with its multiplicity. In order to understand , one then collects all bars containing ; in order to understand , one collects all bars whose interval is a superset of . Note that by construction, a bar is not a superset of .
The second popular visualization is as a persistence diagram. Instead of drawing bars, i.e. intervals , we draw a dot in the -plane. We can read off by collecting all dots lying strictly to the left () and upwards of the point . We call this diagram .
We can merge these two visualizations by drawing for each basis vector; that is, for each dot in the persistence diagram, we draw a straight line to the right until we hit the diagonal; or, alternatively, we stack the bars, drawn horizontally, using the death as -coordinate.
Example (The unit circle).
As a simple example, consider the unit circle with the standard distance and as equispaced sample points. We can describe the persistent homology in terms of bar-codes or as a persistence diagram; for this graphic, we merged both representations, and computed persistent homology up to dimension , c.f. Figure 0(a). From this, we can read off that, e.g. for scales , all embeddings induce isomorphisms on homology, which is given by one dimension in (single connected component), and one dimension in (we roughly have an ).
It is possible to define a notion of persistent homology of Rips complexes for infinite metric spaces, like , and compute such persistence diagrams. We refer to [AA17] for the details; we plotted the persistence diagram of the actual sphere in Figure 1.
2.2 Interleavings and Gromov Hausdorff distance
The notion of approximation of persistence modules is that of interleavings:
Suppose we have given two persistence modules of -vector spaces and , and a nondecreasing function , with . A -interleaving is a collection of linear maps for and for , such that everything commutes (i.e. also with and for ).
We then say that is -interleaved into (equivalently: is interleaved into ).
A way of obtaining an interleaving for Rips complexes is by perturbing the distance function : If we have some such that for all then for all . Applying the homology functor turns this interleaving of simplicial flag complexes with embeddings into an interleaving of vector spaces with linear maps.
As a more interesting example, consider a finite metric space , and an -dense subset , i.e., we have a projection with for all . Let denote the Rips-complex of , and denote the Rips-complex of . We clearly have for all ; however, we cannot have , for any , as an inclusion of simplicial complexes; there simply are not enough points in if .
This can be remedied by “multiplexing points”; that is, by considering . By the triangle inequality, , yielding , for .
Intuitively, and should describe the same space, and hence , which gives an interleaving . The topological basis for this is the following:
Let be flag-complexes, and let be a projection of the underlying point set. We say that this pair is a single step deformation if is simplicial, i.e., whenever (this means that is well-defined) and contiguous to the identity, i.e., for all , i.e., .
One says that and are simple homotopy equivalent, if there exists a sequence of single-step deformations connecting and . If , by a sequence of increasing single step deformations, then we write .
Simple homotopy equivalent complexes are homotopy equivalent, when considered as topological spaces of formal convex combinations: This is because single step deformations are homotopy equivalences, by and with and defined by
Considering the definition of , we see that ; and, since homology is invariant under homotopy equivalence, we really obtain an interleaving in homology.
Using -dense subsets, we naturally obtain an interleaving for two finite metric spaces and that have Gromov-Hausdorff distance bounded by . That is, we assume that we can extend the metric from to , preserving the triangle inequality, such that is dense and is -dense. Then we obtain an interleaving .
The interleaving is even slightly tighter than this by the following observation: If is -interleaved into , modulo simple homotopy equivalence, and , then is also -interleaved into , as is contractible.
Consider the example 2.1 of . The set is -dense. We therefore obtain a -interleaving, from any finite into , modulo simple homotopy. In this sense, the persistence homology of approximates, up to -interleaving, the persistent homology of .
2.3 Approximate persistence diagrams
Now, suppose that is -interleaved into . Consider the corresponding commutative diagrams (1). Given some feature persisting in from to , with , this feature must also persist in from to : Consider the left diagram in (1). Take a preimage , and consider :
Likewise, a feature persisting in from to with must persist in from until . In terms of ranks, we must have and .
In other words: The set of basis features (from the normal form, i.e., the ) persisting in from until form a linearly independent subset of the space of features persisting in from until . If , then the set of basis features persisting in from until span the space of features persisting in from until .
In order to visualize this information in the persistence diagram, we draw for each basis-vector in a “bar-with-errors”, i.e. a rectangle , instead of the dot . A lower bound on can then be read of by counting all rectangles, that are contained in the upper left corner . If , we can read off an upper bound by counting all rectangles that intersect the upper left corner . For an example, see Figure 2.
If we additionally draw the graph of , we can partially match rectangles from the persistence diagram of with dots from the persistence diagram of . This means, we
match dots only to rectangles in which they are contained,
the matching may leave some rectangles with upper right corner below the graph of unmatched, but not others,
and the matching may leave some dots below the graph of unmatched, but no others.
This partial matching can be formalized as the Interleaving Theorem (see Appendix B). Note that the theorem uses a more ’symmetric’ definition of interleavings than Definition 1: Instead of demanding , interleavings are defined with respect to two maps and with and . This more general case can be again reduced to an interleaving in the sense of Definition 1 by defining and , yielding . For the sake of self-containedness, we give a proof of the interleaving theorem in Appendix B; alternatively see the references in [Car09]. The approximate visualization is, to the best of our knowledge, original to this work.
2.4 Non-sparse computational scaling
Let us consider how the computation of a full persistence diagram scales with the number of sample points in a metric space. Usually, algorithms will need to have access to the full distance matrix, to enumerate simplices, and to compute the Smith normal form of the boundary matrix (which has simplices as rows and columns).
Firstly, the distance matrix will have entries, each of which needs to be computed and potentially stored in memory. This precludes any “big data” applications.
Next, we count the number of simplices: We will have many simplices of dimension for , and many simplices in total. Any algorithm that needs to enumerate simplices will become infeasible for computing a full exact persistence diagram, once we have more than 40 sample points. This can be somewhat alleviated by instituting a dimension cut-off at some small , bringing the number of -simplices down from exponential to a polynomial . Many algorithms, however, will actually need to enumerate the -simplices in order to compute th homology.
As the linear algebra of how to compute homology and persistence of simplicial complexes is not the focus of this work, we just mention that this scales worst-case in the third power in the number of simplices, in practice however often only linearly. All homology computations in this paper were done by ripser [Bau17], which is a state of the art software for computing persistent (co-)homology over fields for small primes , given a distance matrix (which does not need to fulfill the triangle inequality).
From these considerations, we can see why the exact computation of persistence diagrams is limited to relatively small data-sets and dimensions. For a survey of current software, see [OPT17].
2.5 Computational scaling of sparsified complexes
As the main bulk of computational scaling is not related to the realm of linear algebra, it is necessary to reduce the number of simplices. All three sparsification strategies mentioned in this work —subsampling and thresholding, [She13], and the strategy introduced in this paper— exhibit a linear asymptotic scaling of the number of simplices. In pratice we generally observe a speed up as a result of the sparsification, although the asymptotic regime of linear scaling is usually only entered for very small dimensions. This drawback is due to exponential scaling of the constants in the desired precision.
Let us first recall subsampling and thresholding. One fixes a lower bound , and chooses a subset of sample points with a projection , such that holds for all . We say that this subsample has density if for all in . One then thresholds the resulting space by fixing a and only considering the persistence diagram in the range . This means edges that have are taken out of the considerations. Let be the diameter of the space. We have seen in Section 2.2 that is then -interleaved into with
It is instructive to rewrite this as a relative error bound: Let be the quotient of scales; the lowest nontrivial relative error is obtained at , for an error of for relatively large .
In order to count simplices, we will need to make some assumptions on the metric space . One standard assumption is the following:
The metric space has bounded intrinsic dimension , i.e., we can place at most distinct points with pairwise distances bounded below by in the ball , for every and .
It is clear that all subsets of have bounded intrinsic dimension, and that this property is invariant under isometries.
If we assume A 1
, we can estimate the number of simplices in: Each edge must have . Assume that ; then, each point can have at most neighbors (in the considered as a graph). Therefore, the number of -simplices is bounded above by
From this simple count, we can infer linear scaling in the number of points, albeit polynomially bad scaling in the relative precision, and exponentially bad scaling in the dimension of simplices and the intrinsic dimension of the dataset.
Note that in subsampling and threshholding the relative precision is only obtained close to the threshhold at . Our own approach is to remedy this shortcoming while retaining a similar number of simplices.
For this we prescribe an arbitrary interleaving . Usually, this will have the form
with absolute precision and relative precision . We can relate to a precision parameter via the formula , or . We then construct a family of approximating subsets , with for , and bounded density , and consider only (a subset of) edges with , where . In order to bound the number of simplices, we attribute each simplex to one of its vertices with minimal . Assuming , the number of simplices attributed to each point is bounded above by , because each relevant neighbor needs to have and hence .
The price we have to pay for this global interleaving is somewhat less flexibility in the choice of subsamples: It is trivial to find a subsample for fixed using evaluations of the metric; we are, however, constrained to contraction trees, that carry some additional structure. This yields , in metric evaluations.
3 Sparse Approximations
Fundamental to the approximation is the organization of the finite metric space as a rooted tree, such that the intricate details are at the lower levels, while the higher levels provide a coarse overview. The structure we use specifically is that of a -contraction tree.
An ordered rooted tree on a finite set is an enumeration, i.e. an ordering, and a parent map, which is a strictly decreasing function . In an abuse of notation, we also consider it as a map , via .
We call the truncation of the tree. The projection onto the truncation is given by going up in the tree until we are in , i.e. by with .
A contraction tree is an ordered rooted tree additionally endowed with nonincreasing contraction times with for . This defines the function , and associated truncations and .
A contraction tree on a metric space must additionally fulfill the inequality , for all and . If we have the stronger estimate , for some nondecreasing function with , then we call it an -contraction tree, and the metric precision function.
Given a contraction tree on and a symmetric weight function , let be the restriction of the Rips-complex to . Hence, we have , and and for .
The construction of sparsified complexes will be in reverse order: First, we construct a sparsified complex from a given -contraction tree. Then we discuss how to prescribe the metric precision function in order to obtain desired relative and absolute errors, and how to get this -contraction tree from an -contraction tree. Finally, we construct the -contraction tree from a combinatorial tree, and this combinatorial tree from the metric space.
3.1 The sparsified complex
Consider the parameter fixed. Our goal is to use an -contraction tree on a finite metric space in order to obtain a sparse approximation of the Rips complex , which is interleaved modulo simple homotopy equivalence.
The general idea is to construct a sparsified complex by definition of a length function , such that most edges will be removed from the complex by having length , while the others keep their -value. This construction will ensure that , and corresponds to the subsampled complex in Section 2.2. Next, we need an analogue of the multiplexed complex ; this will be , which fulfills , and otherwise attempts to minimize the length function . The crucial estimate to obtain the interleaving
is , which will be shown in Lemma 6.
Definition 4 (Sparsified and Implied Complexes).
The sparsified complex is characterized by the subset of missing edges where ; for all other edges, we have . The implied complex is defined by the implied length ; it has for all non-missing edges. The two complexes are constructed recursively (by a depth-first dual tree traversal, see Section 3.4): In order to assign with , we need to know . We distinguish the following four cases:
If , then , and .
If , and , then and .
If , and , then and .
If none of the above apply, i.e. if and , then .
For the sake of this definition, we consider to be always non-missing.
By construction, and . Sparsity of comes from for all non-missing edges. By the condition for all , it follows that edges are always non-missing, and we can recursively enumerate all non-missing edges by starting from by traversing a tree.
The sparsified and implied complexes have the following properties:
Suppose . Then the inclusion is a single step deformation with .
Suppose . Then the inclusion is a single step deformation with .
The complexes coincide.
Assume without loss of generality that .
The two complexes differ only by the single point . We have already seen that ; hence, the induced is contiguous to the identity. In order to see that it is simplicial, let be such that . Then by construction (case (d)), .
As in (1) the two complexes differ only by the single point . Moreover, from the definition of and , we see that they differ only in the missing edges. Thus, contiguity to the identity follows as before, and for simpliciality we only have to consider the case . This rules out (d) and (a). If the case applies, then . If the case applies, then .
As , we have trivially . Thus, we have to show that implies for all , i.e. for all with . This is equivalent to showing that the edge is non-missing if . Now, if for , then either (a), (b), or (c) apply. If either (b) or (c) applies, it follows directly that . If (a) applies, then as well by induction and monotonicty of the . Thus, the claim follows.
The critical estimate to complete the interleaving is the following:
Suppose that we have a -contraction tree, i.e., for all and .
Then, for all , we have . Therefore , where .
Let without loss of generality with . We will consider the four cases in the definition of in reverse order:
As , the claim is trivial.
In this case we have .
In this case we have .
This case is handled recursively: For with , find such that , and , , and . Then and
either , and
or (as is non-missing), and
3.2 Relative error interleavings
Assume that the dataset is organized into a contraction tree with contraction times and for all . Lemma 6 permits us, in principle, almost arbitrary choices of the interleaving- and the metric approximation , as functions. A very natural choice are interleavings with prescribed relative errors, i.e. of the form . In order to obtain such an interleaving, we construct a -contraction tree by setting contraction times , where . In view of Lemma 6, this yields the estimate , and hence , which is the desired estimate.
As before, the interleaving error actually cuts off at the radius: For , the complex is contractible. This is also the case for : We know that for all , by considering cases . Hence, we actually have the better interleaving estimate .
If data is plentiful, it is often necessary to further restrict the number of samples: That is, we truncate at some , in addition to the multiplicative error . This means, we set for