Context and Motivation.
Topological data analysis attempts to extract relevant information out of a data set by interpreting data as a shape and understanding the connectivity of this shape. Connectivity clearly includes the study of connected components, but also the presence of tunnels, voids, and other high-dimensional topological features.
How do we interpret data as a shape? We consider the common scenario that our data is a set of points in a possibly high-dimensional Euclidean space . Fixing a positive real scale parameter , we define (or simply ) as the union of balls of radius centered at each point in . The shape can be seen as an approximation of the shape formed by for a parameter . As seen in Figure 1, for different values of , the union may form different topological configurations raising the question of the “right” -value that accurately captures the shape of .
One answer, formalized by the theory of persistent homology, is to generate a summary of every topological feature formed at any scale. Note that for . The collection together with inclusion maps whenever forms a filtration. The topological properties of this filtration can be summarized by a finite multi-set of points in the plane, called the persistence diagram. For every topological feature, a point in the diagram denotes the range of scales for which the feature is present. This range is called the persistence of the feature.
Importantly, there are several notions of a distance between two diagrams. In this paper, we use the multiplicative bottleneck distance to compare persistence diagrams. Intuitively, diagrams with small distance have similar features that have long persistence; note that this does not imply that their homology is equal on any scale (see Section 2 for details).
Another important aspect of the above theory is the possibility of efficiently computing persistence diagrams. The Čech complex is the intersection/nerve complex of the balls of radius and has the same homological properties as the union of balls. The derived Čech filtration is a combinatorial object yielding the same persistence diagram as the union of balls. The problem with this standard approach is the sheer size of the produced filtration. It is common practice to only look at the -skeleton of the Čech filtration, that is, ignore intersection of or more balls, where is another constant parameter. That filtration consists of simplices, which is too large for large data sets, even for relatively small values of . In this paper, we will present a compact and efficiently computable approximation of the Čech filtration.
Problem statement and prior work.
We study the question: can we efficiently compute a discrete filtration , called the approximate filtration, that is significantly smaller than the Čech filtration with the persistence diagrams of Čech and approximate filtration being close to each other? Note that the question defines three quality criteria: the size, computation time and the quality of the approximation.
Sheehy [sheehy-rips] showed that an -approximate filtration of size can be computed efficiently for the -skeleton. For , the size of this filtration is . His result, formulated for the closely related Vietoris-Rips complex, also extends to the Čech complex. Note that the size is linear in the number of points, which is a significant improvement over especially when is large and , , and
are of moderate size. The technical idea of the approximation scheme is to compute a net-tree based hierarchical clustering and using it to combine balls whose centers are in the same cluster into a single ball. As a result of this, a large number of ball intersections can be avoided. Similar ideas with the same guarantees, have appeared in[bs-approximating, dfw-gic, ks-wssd]. The dimension in the complexity bounds can also be replaced with the doubling dimension of the metric space, which makes the result especially useful for spaces with low doubling dimension.
Choudhary et al. [ckr-polynomial-dcg] presented an alternative approximation scheme, where the approximation quality is only , but the size of the approximation is significantly smaller at . This result is obtained by tiling the space into permutahedra (a generalization of the hexagonal grid in ) where the diameter of the permutahedra is controlled by the scale parameter . Every permutahedron containing at least one input point is selected and the nerve of every selected permutahedron is reported as the approximation. The improvement in size stems from the fact that at most -permutahedra can intersect in dimensions which upper bounds the number of simplices in the nerve. The approximation quality has been improved to in subsequent work, with size [ckr-barycentric]. In [ckr-polynomial-dcg], it is also shown that, for any with , any -approximate filtration must have a size of at least .
We derive an approximation scheme of size
for an -approximation of the Čech complex for . For constant dimension , this simplifies to , improving all previous results for the Euclidean case. We achieve this approximation based on the techniques devised in [ckr-barycentric, ckr-polynomial-dcg]. On a fixed scale , we tile the space with a cubical grid, carefully select a subset of them and take their nerve as our approximation complex. The novelty lies in the resolution of the tiling. Unlike in [ckr-polynomial-dcg] where the diameter of a permutahedron was in the order of , we use a much smaller diameter of roughly , and we select a hypercube in the approximation if its center is -close to an input point. This is equivalent to approximating the union of -balls with cubical “pixels” at resolution (see Figure 4). The number of selected pixels can be bounded by which is significantly larger than the number of points. Perhaps counter-intuitively, its nerve is still smaller than in previous approaches, because each vertex of the nerve is only incident to simplices in the -skeleton. This technique resembles previous work in computational geometry where adding points (referred to Steiner points) helps to reduce the size of triangulations [beg-provably]. Hudson et al. [hmos-meshing] used similar techniques to compute an -approximation of the Čech complex of size .
Our results pave the way for important extensions whose proofs we omit for the sake of brevity of the presentation. We only announce them to underline the importance of our techniques and postpone the technical discussion to an extended version of this manuscript:
Our approximation from above is not a filtration, but a simplicial tower, that is a sequence of simplicial complexes connected by simplicial maps. Our technique allows, however, for a definition of an actual approximate filtration of the same size. This filtration has the additional property of being a flag complex at each scale, which has computational benefits as its persistence diagram can be computed using faster algorithms.111e.g., Ripser by U.Bauer: https://github.com/Ripser/ripser
The factor in our size bound can be further reduced to by replacing the cubical grid by a permutahedral grid [ckr-polynomial-dcg]. A consequence of this result is that when and , for some , our approximation with the permutahedral grid has a size of matching the lower bound of [ckr-polynomial-dcg].
The result of this work are proved in two steps. We first devise a simple approximation scheme which approximates the union of balls by pixels of a carefully chosen size on each scale. It follows with standard techniques that the approximation quality is . However, the total size of the approximation has an additional factor that depends on the spread of the point set. To achieve spread-independence, we use the observation that when we increase the scale from to , the growth in radius of a ball does only affect the topology of the union of balls if the enlarged balls gives rise to a new intersection pattern with other balls. In other words, we can delay the growth of a ball (and hence, avoiding resampling the balls by pixels) if is not in a critical range of scales. Moreover, we can bound the critical range of all balls efficiently through a well-separated pair decomposition (WSPD) of the point set. This suffices to reduce the total number of pixels used in the approximation to . One technical difficulty arising from this approach is that on a fixed scale, the approximation complex consists of cubes of different sizes.
Outline of the paper
We give a short introduction to topological concepts that are essential for our results. For more details, we refer to standard references such as [bss-metrics, cdgo-sspm, eh-book, hatcher, munkres].
Simplicial complexes and simplicial maps
A simplicial complex on a finite set of elements is a collection of subsets called simplices such that each subset is also in . In other words, simplicial complexes are closed under taking subsets. We say that a simplex has dimension , in which case is called a -simplex. A simplex is a face of if . The -skeleton of consists of all simplices of of dimension or lower. For instance, the -skeleton of is a graph defined by its -simplices and -simplices.
For a point set and a real number , the Čech complex on at scale is the collection of all simplices such that there is a non-empty intersection between the Euclidean balls of radius centered at the points of . Equivalently, if the minimum enclosing ball of has radius at most . We denote the union of -balls as . See Figure 2 for an example.
A simplicial complex is called a flag complex, if it has the following property: whenever a set has the property that every -simplex is in the complex, then the -simplex is also in the complex.
A simplicial complex is a subcomplex of if . For instance, is a subcomplex of for . Let be another simplicial complex. Let be any map that assigns to each vertex of a vertex of . A simplicial map is a map defined using some vertex map , such that for every simplex in , the set of vertices is a simplex of . For , the inclusion map is an elementary example of a simplicial map. Note that any simplicial map is completely specified by its action on the vertices of the domain.
Homotopy and Nerve theorem
Let be two continuous maps between topological spaces . A continuous function is said to be a homotopy between and , if and . In this case, and are said to be homotopic to each other, and we record this relation as . Informally, the second parameter of can be interpreted as time, so that is a continuous deformation of into , as time varies from 0 to 1.
and are said to be homotopy equivalent if there exist continuous maps and such that and , where and are the identity functions on and , respectively. We record this relation as . Intuitively, two spaces are homotopy equivalent if they can be continuously transformed into one another.
Let denote a finite collection of sets. The nerve of is an abstract simplicial complex consisting of simplices corresponding to non-empty intersections of elements of , that is,
For instance, the Čech complex at scale is .
Theorem 2.1 (nerve theorem).
Let us denote by a finite collection of closed sets in Euclidean space, such that all non-empty intersections of the form are contractible. Then, , that is, they are homotopy equivalent (see [bjorner-detailed, bjorner-small, borsuk-nerve, walker-hom] for more general versions of the theorem).
In particular, if is a collection of convex sets, then all non-empty common intersections are contractible, so that the nerve theorem applies to . As an example, .
Filtrations and simplicial towers
Let be a set of real values which we refer to as scales. A filtration is a collection of simplicial complexes indexed by , such that for all . For instance, is the Čech filtration. A (simplicial) tower is a sequence of simplicial complexes with being a discrete set (for instance ), together with simplicial maps between complexes at consecutive scales.
A tower of the form on simplicial maps with index set can always extend the tower to , by setting and for all . We call this technique of extending towers the standard filling technique in our paper for simplicity. The approximation constructed in this paper will be an example of such a tower.
We say that a simplex is included in the tower at scale if is not in the image of the map , where is the scale preceding in the tower. The size of a tower is the number of simplices included over all scales. If a tower arises from a filtration, its size is simply the size of the largest complex in the filtration (or infinite, if no such complex exists). However, this is not true in general for simplicial towers, since simplices can collapse in the tower and the size of the complex at a given scale may not take into account the collapsed simplices which were included at earlier scales in the tower.
Persistence diagram and Interleavings
A collection of vector spacesconnected with homomorphisms (linear maps) is called a persistence module, if is identity on for all , and for all for the index set .
We can generate persistence modules using simplicial complexes. Given a simplicial tower , we can fix some base field to obtain a sequence of vector spaces with linear maps , that forms a persistence module. This is true because of the functorial properties of homology [munkres]. The same construction is also applicable to filtrations.
Under certain tameness conditions, persistence modules admit a decomposition into a collection of intervals of the form (with ). These intervals can be represented as a set of points in the plane, called the persistence diagram, where each interval is simply represented as the point . See Figure 3 for an example. The persistence diagram of a persistence module characterizes the module uniquely up to isomorphism. If the persistence module is generated by a simplicial complex, each point in the diagram corresponds to a homological feature (a “hole”) that comes into existence at complex and persists until it disappears at .
Two persistence modules and with respective linear maps and are said to be (multiplicatively) strongly -interleaved if there exist a pair of families of linear maps and for , such that Diagram (1) commutes for all (the subscripts of the maps are excluded for readability). In such a case, the persistence diagrams of the two modules are said to be -approximations of each other in the sense of [ccggo-proximity]. More precisely, there is a partial matching between the points of the two diagrams, after a suitable logarithmic scaling of the diagrams. Further details can be found in [buchet-thesis]. See Figure 3 for an example.
Next, we discuss a special case that relates to the equivalence of persistence modules [cz-computing, handbook]. Two persistence modules and on linear maps respectively are isomorphic, if there exists an isomorphism for each such that the diagram
commutes for all . Isomorphic persistence modules have identical persistence barcodes.
3 A simple digitization scheme
In this section we describe our first approximation scheme which is conceptually simple and lays down the foundation for the more technical approximation scheme that we present in Section 4. We describe our scheme for a set of points in -dimensional Euclidean space. We assume throughout that .
A cubical subdivision
Let . For any , we define the lattice as the lattice whose basis vectors have been scaled by . We define the pixels of this lattice as the smallest -dimensional cubes whose vertices are the lattice points. The diameter of the pixels of is at most . Our approximation complexes are built as nerves of the union of pixels of .222It seems simpler at first to define the approximation as a cubical complex directly instead of taking the nerve, but it is more complicated to construct the chain maps connecting different scales in this cubical setup.
Each pixel of lattice is fully contained in some pixel of when with both in . This also implies that each pixel center of has a unique nearest-neighbor among the pixel centers in .
3.1 Approximation complex
Let denote the -dimensional Euclidean ball of radius , centered at any input point . We denote by
the set of -balls centered at the input points. Naturally, the Čech complex on at scale is the nerve of , that is, .
Let denote the set of scales
We now define our approximation complex for . Consider the maximal such that . As discussed, there exists a cubical subdivision based on the lattice . Let denote the set of pixels of that subdivision whose center lies in , that is, which are in distance at most to a point in . We define . Also, we write for the union of all pixels in . See Figure 4 for an illustration of and .
Lemma 3.1 (Sandwich lemma).
For any ,
For , let be as in the definition of . For the first inclusion, if , lies in some pixel of the grid . Let denote the center of this pixel. There is some with . The diameter of is , hence , and by the triangle inequality, .
For the second inclusion, fix and let such that . Note that as well, and the approximation complex is either constructed using the same as for , or using if . In both cases, lies in a pixel with diameter at most , so the distance of to the center of is at most . By the triangle inequality, , so the pixel belongs to . ∎
The lemma shows that . Using the standard filling technique from Section 2, we extend this discrete space to a continuous filtration:
Moreover, the Sandwich lemma together with the strong-interleaving diagram (Diagram (1)) implies at once:
-approximates the persistence module .
Note that we obtain instead of because we consider the continuous filtration instead of the discrete filtration of .
The theorem also implies that is a -approximation of the Čech filtration since the Čech filtration is dual to and has the same persistence diagram.
3.2 Connecting the scales
We now turn our attention to the approximation complex . While at each scale , is the nerve of and forms a filtration, it is not true that for all since . Therefore, it is not sufficient to apply the persistent nerve lemma of [co-pnl] directly, but it requires a more involved analysis, which we describe next. We show that the complexes are connected by simplicial maps.
A first useful property is that is a flag complex. This follows from the following statement, which we prove in more general form for later use. We define an axis-aligned cuboid in as the Cartesian product of intervals (where the degenerate case is allowed that consists of only one point). For instance, all pixels of any lattice are (non-degenerate) cuboids.
The nerve complex of any finite collection of cuboids is a flag complex.
We show that if there is a set of cuboids that pairwise intersect, then all of them have a common intersection. The intersection of the set of cuboids
is the cuboid
and it suffices to show that these intersection are non-empty coordinate-wise. By assumption, is non-empty for all . Helly’s theorem [helly] for the case of intervals implies that all intersect commonly. ∎
Next, we consider and define a simplicial map . This is simple if and are constructed using the same grid, as in that case and consequently, . If the grid changes, there is no direct inclusion. There is, however, a natural map mapping each pixel in to the unique pixel in that contains . That fact that is indeed in follows from Sandwich Lemma.
The map extends to a simplicial map .
From Lemma 3.3 we know that is a flag complex. Therefore it suffices to show that maps every edge of to either a single vertex or an edge of . But that follows at once, because an edge in corresponds to two pixels on the grid of which are intersecting, and the map as defined above either maps both of them to the same pixel, or to two pixels which are also intersecting. ∎
By composing the maps above, we obtain maps for all . Using the standard filling technique, we define the (continuous) simplicial tower
We next establish a relationship between the approximation tower and . More precisely, we will show that both towers yield the same persistence diagram. Using Theorem 3.2, this implies that is a -approximation of the Čech complex.
Since consists of convex objects, the nerve theorem (Theorem 2.1) asserts that , for each scale , so they have isomorphic homology groups. All that is left to show is that the homology map induced by the simplicial maps from above commutes with the isomorphisms from the Nerve theorem. We prove this indirectly by introducing an intermediate filtration that is equivalent to both filtrations. This intermediate filtration considers the union of all pixels with . We shift the technical details to Appendix LABEL:subsection:appendix-section3-interleaving-alternate and just state the final result.
The persistence modules and are isomorphic.
We conclude with the main result of this section. Setting in our approximation, we see that . We conclude that
and are -approximations of each other, for .
3.3 Size and Computation
For every , the -skeleton of the approximation tower has size
Let be such that the complex is built using the lattice . Note that . The sidelength of a pixel of this lattice is . A ball of half this radius is contained inside the pixel. Since the pixels are interior-disjoint, a simple packing argument shows that each ball in is covered by no more than
pixels. There are balls of , hence, contains at most pixels. Each pixel is incident to at most other pixels. Each simplex incident to a pixel has vertices among these pixels. Therefore, the size of the -skeleton incident to each pixel is then . In total, the -skeleton of has size
independent of . ∎
To compute the complex at a given scale , we first find the pixels at that scale using a simple flooding algorithm that starts at vertices of . Then we inspect the neighborhood of each pixel to compute the -skeleton incident to that vertex. See Appendix LABEL:subsection:appendix-section3-algo for more details.
At each scale of , the -skeleton of the approximation tower and the simplicial map can be computed in time
4 Removing the spread
The size and computation bounds from Section 3.3 only hold for a single considered scale. To get a bound on the total size and complexity, one needs to multiply with the considered number of scales. If the entire filtration is of interest, this number can be upper bounded by the logarithm of the spread of the point set (the ratio of the diameter and the closest distance). In this section, we remove that dependence on the spread.
4.1 A short overview of the approximation scheme
We first informally illustrate the idea behind eliminating the dependence on spread. The main exposition starts from Subsection 4.2, where we start with the details of our construction.
The crucial observation guiding our improved approximation scheme is that while the union of balls changes continuously for all , the Čech complex changes only at a finite number of scales. For instance, if an edge enters the Čech filtration at , then the intersection of the -balls at and captures the edge. But at higher scales, it is pointless to allow these balls to grow further to represent this edge. So we allow the balls to grow individually only in a lazy fashion, only when they form new connections. This ensures that the set of edges of the original Čech filtration are correctly captured. Allowing for some slack in growing the balls, that is, if we allow the balls at to grow in the interval , we can capture all simplices incident to and . Naturally, this lazy union of balls is homotopy-equivalent to the original union. See Figure 5 for an example.
The lazy union is non-homogeneous, so it is non-trivial to pixelize it. The ideal approach is to pixelize each ball according to its radius, to get a non-homogeneous union of pixels. As we show later, this does not pose a problem if the pixels are chosen as in Section 3. To construct the pixelization efficiently, we make use of a well-separated pair decomposition. Each pair of points has roughly the same distance as the distance between any points of a well-separated pair that covers . Hence, the lazy balls at the representatives approximate the lazy balls at the points of the pair, and we use the former for pixelization. The use of an -WSPD ensures that there are at most lazy balls. See Figure 5 for an example (further details in Subsection 4.2).
We collect the pixels that intersect the union of balls. While the lazy union and its pixelization do not interleave on a space level as in Lemma 3.1, we show that they still form persistence modules that are closely interleaved (Subsection 4.3). The nerves of these pixelizations are our approximation complexes, and they can be connected to form a simplicial tower: for this, we simply use the map that takes a pixel at a lower scale to the pixel at the higher scale that contains it.
To compute the pixelization, we first construct a WSPD and identify the intervals at which the representatives are growing. At each such scale, we identify only those balls that are expanding, pixelize them, and take the nerve. This constructs our simplex inclusions at that scale. More details follow in Subsection LABEL:subsection:sizecomputation_new.
4.2 Lazy union of balls
We review the concept of a well-separated pair decomposition (WSPD) from [ck-wspd], which plays a crucial role in our algorithm. A -WSPD of consists of pairs of the form that satisfy
each is a well-separated pair (WSP), that means,
and for each pair of points , there exists a WSP in the WSPD such that either or . That means, a WSPD covers each pair of points of .
Let denote a -WSPD on , where . For each WSP let denote the points of in and , respectively. We pick a point and call it the representative of (similarly for ). We denote the distance between the representatives as . From the WSPD-property, . Using the triangle inequality, we see that the distance between any two points of and lies in the interval since . For reasons that becomes apparent later, we scale the interval by a factor of and call
the active interval of the pair . For every point , we define
as the active interval of . Each scale is an active scale. Next, we define for ,
We can interpret these definitions as follows: specifies a range of scales for which the -ball of might encounter new intersections. The function is monotonously increasing, for all , and if is an active scale; otherwise, the radius of the ball just remains at the last encountered active scale. is the union of balls with radii given by the functions.
Note that whenever . Hence, is a filtration.
The persistence module is a -approximation of (and consequently, also of the Čech filtration).
The (somewhat tedious) proof can be summarized as follows: we define an intermediate filtration of balls where not only the balls of representatives, but of all balls participating in a WSP grow. In the first part, we show that this filtration can be sandwiched with that of . In the second part, we show that the intermediate filtration has the same nerve as the union of -balls.
We now define the intermediate filtration. Let
be a scaled version of the active interval from above. Define
Intuitively, all the balls centered at points in a WSP grow when the WSP is active (but those of the representatives grow for a slightly longer time).
For all , .
The first inclusion follows at once from the fact that for all points of and all scales. For the second inclusion, let be a point in and let be such that . There is some WSP such that lies in and is the representative of or of , or lies in some a WSP, and or . In the first case, it follows that and hence . For the rest of the proof, we assume without loss of generality that lies in some and .
Let be the representative of . Then is a subset of . By our choice of and , we know that contains the value because for . On the other hand, , where the last inequality comes from the fact . By triangle inequality, , which implies the claim. ∎
The lemma implies that the two filtrations -approximate each other. Next, we consider
which is the collection of balls whose union forms . By the nerve theorem, the nerve of is homotopically equivalent to , and since the balls have non-decreasing radius, the persistence modules and are persistence-equivalent. The latter, however, is exactly the Čech persistence module, as we show next.
for all scales .
Because , follows at once. For the other direction, fix any simplex of . Let denote the smallest radius such that the -balls around the intersect. It suffices to show that for all . For fixed, let denote the point of with maximal distance to . Clearly, . On the other hand, there exists a WSP that covers giving rise to the interval . With the well-separation property and the triangle inequality,
which leads to the inequality
since . Moreover, , and a similar calculation shows that
which implies that . Since or , this implies immediately that for . ∎
We next define a collection of pixels for each . In contrast to Section 3, these pixels will be taken from different grids. For technical reasons, we enlarge the critical intervals by the smallest amount such that their endpoints are points in . The definitions of , and get adapted accordingly, without affecting the claims of Lemma 4.2 and Lemma 4.3.
We assume again without loss of generality that we start the construction at a minimal scale that is sufficiently small and set . We define for inductively: let denote the largest power of that is not larger than as in Section 3. Call a point active if , and inactive otherwise. Let be the collection of all pixels whose centers lie in the any ball of radius at an active points . Define as the inclusion-maximal pixels of the set . We denote by the union of the pixels in . See Figure 6 for an illustration. We use the standard filling technique to define for every .
To rephrase our construction: at scale , we resample the pixels of all balls which are growing from to in the filtration , and add them to the set of pixels, removing lower-scale pixels which are covered by the new ones.
The idea of the digitization is to approximate with pixels. Indeed, we have that . That follows inductively, observing that the balls that grow from to are covered by pixels. It is also true that , where is obtained from by enlarging every ball by a factor of . The proofs are similar to the proof of Lemma 3.1. However, it is not true that in general (more precisely, neither nor holds in general). Therefore, the simple sandwiching strategy from Section 3 fails. We need a more refined argument, showing that the interleaving still works on the homology level.
For that, the following observation is crucial:
The spaces and are homotopically equivalent, and the maps realizing this homotopic equivalence can be chosen as deformation retracts to the intersection .
The statement follows from the following statement on unions of balls, noting that the balls forming , , and all have the same nerve by construction.
Let , be collections of closed balls such that and and have the same center for all . If , there is strong deformation retraction from to .
We only briefly sketch the proof idea and postpone details to an extended version. Note that because the balls are closed, there exists an , such that we can increase all balls of without changing the nerve. This yields a collection of balls with the same centers, such that . We can define a collection of distance-like functions , such that is the sublevel set of for value , and is the sublevel set of for value . Let denote the lower envelope of these functions. This function is continuous, but not differentiable; however, using techniques similar to the generalized gradient [ccl-sampling], it is possible to define a flow on . The singularities of this flow are outside of because a singular points of the generalized gradient triggers a change in the nerve. That flow hence defines a deformation retract from to .
We can use the above to define a deformation retract from to : Simply restricting the retract to is not sufficient because a point of may be pushed out of during the flow. However, note that was defined using some parameter . Call a point -affected if there exists an such that , but . For every , we can find some such that an open neighborhood of is not -affected. If is not -affected, its flow does not change when further decreasing . Hence, for every , we can define its flow to be the limiting flow for . Because of the neighborhood property above, the resulting flow is continuous on , and defines the desired deformation retraction.
Lemma 4.6 (Lazy sandwich lemma).
The two persistence modules and are -approximations of each other.
We need to define interleaving maps such that the four diagrams of (1) commute. We restrict our attention to scales in and show a -interleaving for these scales, which implies the result.
From , we take the homology map induced by the inclusion . With that, the diagram