Approximation algorithms for 1-Wasserstein distance between persistence diagrams

04/15/2021 ∙ by Samantha Chen, et al. ∙ University of California, San Diego 0

Recent years have witnessed a tremendous growth using topological summaries, especially the persistence diagrams (encoding the so-called persistent homology) for analyzing complex shapes. Intuitively, persistent homology maps a potentially complex input object (be it a graph, an image, or a point set and so on) to a unified type of feature summary, called the persistence diagrams. One can then carry out downstream data analysis tasks using such persistence diagram representations. A key problem is to compute the distance between two persistence diagrams efficiently. In particular, a persistence diagram is essentially a multiset of points in the plane, and one popular distance is the so-called 1-Wasserstein distance between persistence diagrams. In this paper, we present two algorithms to approximate the 1-Wasserstein distance for persistence diagrams in near-linear time. These algorithms primarily follow the same ideas as two existing algorithms to approximate optimal transport between two finite point-sets in Euclidean spaces via randomly shifted quadtrees. We show how these algorithms can be effectively adapted for the case of persistence diagrams. Our algorithms are much more efficient than previous exact and approximate algorithms, both in theory and in practice, and we demonstrate its efficiency via extensive experiments. They are conceptually simple and easy to implement, and the code is publicly available in github.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent years have witnessed a tremendous growth using topological summaries, especially the persistence diagrams (encoding the so-called persistent homology) for analyzing complex shapes. Indeed, persistent homology is one of the most important development in the field of topological data analysis in the past two decades [EdelsLetsZomo2002, EdelsbrunnerHarer2010]

. Given an object, e.g, a mesh, an image, a point cloud, or a graph, by taking a specific view of how the object evolves (more formally, a filtration of it), persistent homology maps the input, a potentially complex object, to a topological summary, called the persistence diagram, which captures multiscale features of this objects w.r.t. this view. Persistent homology thus provides a unifying way of mapping complex objects to a common feature space: the space of persistence diagrams. One can then carry out data analysis tasks of the original objects, e.g, clustering or classifying a collection of graphs, in this feature space. Indeed, in the past decade, persistence diagram summaries have been used for a range of applications in various domains, e.g, in material science

[Buchet2018, Hiraoka7035, LB17], neuroanatomy [HessNeuron, LWAPW17], graphics [CCGMO09, SOC10], medicine /biology [GambleHo2010, lfm-pro2007], etc.

A key component involved in such a persistent-homology based data analysis framework is to put a suitable metric on the space of persistence diagrams, and compute such distances efficiently. One classic distance measure developed for persistence diagrams is the -th Wasserstein distance, and in practice, a popular choice for is , i.e, the 1-Wasserstein distance. This paper focuses on developing efficient, practical and light-weight algorithms to approximate the 1-Wasserstein distance for persistence diagrams.

In particular, a persistence diagram consists of a multiset of points in the plane, where each point corresponds to the creation and death of some topological feature w.r.t. some specific filtration (view) of the input object. Given two persistence diagrams and , the 1-Wasserstein distance between them, denoted by , is similar to the standard 1-Wasserstein distance (also known as the earth-mover distance) between these two multisets of planar points, but with an important distinction where points are also allowed to be matched to points in the diagonal (the line defined by equation ) in the plane. Intuitively, the topological features associated to points matched to the diagonal are considered as noise.

The 1-Wasserstein distance for persistence diagrams can be computed using the Hungarian algorithm [Kuhn] in time where is the total number of points in the two persistence diagrams. This algorithm is implemented in the widely used Dionysus package [Dionysus]. In [Hera], Kerber et al. develops a more efficient algorithm to approximate the 1-Wasserstein distance between finite persistence diagrams within constant factors. Their algorithm is based on the auction algorithm of Bertsekas [BertsekasAuction], but with a geometric twist: given that points in the persistence diagrams are all in the plane, they use the weighted kd-tree to provide more efficient search inside the auction algorithm. In [BertsekasAuction], the time complexity of the auction algorithm is stated to be where, in the case of persistence diagrams, is the number of possible pairings of points between persistence diagrams ( in the worst case) and is

over all possible pairings between persistence diagrams. While Kerber et al. did not provide an asymptotic time complexity for their approximation algorithm, they provided an empirical estimation of

(not true asymptotic time complexity) by using linear regression on the observed running time versus the size of problems. They further show via extensive experiments that their approximation algorithm has a speed-up factor of 50 for small instances to a speed-up factor of 400 for larger instances in comparison to the Hungarian algorithm based implementation.

Related work in optimal transport for Euclidean point sets.

As we will formally introduce in Section 2), 1-Wasserstein distance for persistence diagrams can be viewed as the standard 1-Wasserstein distance for discrete planar point sets with special inclusion of points in the diagonal. In what follows, to avoid confusion, we refer to the standard 1-Wasserstein distance between point sets as the optimal transport (OT) distance. Starting from [Bartal96] and [Bartal98], there has been a long line of work to approximate the OT-distance for Euclidean point sets using randomly shifted quadtrees (e.g, [Charikar2002, KleinbergTardos2002, IndykThaper2003, LYFC19, SatoYamada2020, Flowtree]). In particular, we consider two such approaches, the -embedding approach by [IndykThaper2003], and the flowtree approach by [Flowtree]. The former maps an input point set

to a certain count-vector

with the help of a randomly shifted quadtree, and uses the distance between two such count-vectors to approximate the OT-distance between and . The latter also uses a randomly shifted quadtree and embeds input points to quadtree cells. It then shows that a certain distance computed from an optimal OT-flow induced by the tree metric (which can be computed by a greedy algorithm in linear time) can approximate the OT-distance between the original point sets. Let denote the spread of the union of two input point sets. Both approaches give an -approximation of the OT-distance between original point sets, in time .

Recently, the idea of using metric trees to approximate OT-distance has also been extended to a more general unbalanced optimal transport problem (where ) in [SatoYamada2020]. In [SatoYamada2020], Sato et al. develops an time algorithm to approximate unbalanced optimal transport on tree metrics using dynamic programming.

New work. 

In practice, for applications such as nearest neighbor search, clustering and classification on large data sets, huge numbers of distance computations will be needed. The time complexity of the aforementioned algorithms for persistence diagrams using the Hungarian algorithm or the geometric variant of the Auction algorithm still causes a significant computational burden. In this paper, we aim to develop near-linear time approximation algorithms for the 1-Wasserstein distance between persistence diagrams. Specifically:

  • In Section 3, we show how to modify the algorithms of [IndykThaper2003] and [Flowtree] to approximate the 1-Wasserstein distances between persistence diagrams within the same approximation factor (Theorems 3.1.1 and 3.2). Note that in the literature (e.g, [Hera]), it is known that between two persistence diagrams can be computed by (i) first augmenting and to be and , respectively, where projects a point to its nearest neighbor in the diagonal ; and then (ii) compute the OT-distance between and , although it is important to note that the cost of matching two diagonal points needs to be set to be , instead of the standard Euclidean distance. However, this requires the modification of the cost for diagonal points; in addition, this also needs to modify a diagram depending on which other diagram it is to be compared with. We instead develop a modification where such projection is not needed.

  • Our modified approaches maintain the simplicity of the original approximation algorithms and are easy to implement. In comparison to approximation for unbalanced optimal transport presented in [SatoYamada2020], our modified approaches are specific to persistence diagrams and the data structures needed for both of our approaches are much simpler than those of [SatoYamada2020]. Our code is publicly available in github. In Section 4, we present various experimental results of our new algorithms. We show that both are orders of magnitude faster than previous approaches, although at the price of worsened approximation error. However, note that in practice, the approximate factors are rather small, not as large as the worst case approximation factor. We also note that the modified flowtree algorithm achieves a more accurate approximation of the 1-Wasserstein distance for persistence diagrams than the modified -embedding approach empirically, although the latter is significantly faster than the former. However, the -embedding approach is easier to combine with proximity search data structures e.g, locality sensitive hashing (LSH), given that each input persistence diagram is mapped to a vector and the distance computation is the -distance between two such vectors.

2 Preliminaries

In this section, we first introduce the persistence diagrams and the 1-Wassertein distance between them, which is related to the optimal transport distance (standard 1-Wasserstein distance) for Euclidean point sets. We next describe two existing approximation algorithms for optimal transport distance [Flowtree, IndykThaper2003] based on the use of randomly shifted quadtrees. Our new algorithms (in section 3) will be based on these two approximation algorithms.

2.1 Persistence Diagrams and 1-Wasserstein distance

We first give a brief introduction of persistent homology and its associated persistence diagram summary. See [EdelsbrunnerHarer2010] for a more detailed treatment of these topics. Suppose we are given a topological space . A filtration of is a growing sequence of sub-spaces

which can be viewed as a specific way to inspect . For example, a popular way to generate a filtration of is by taking some meaningful descriptor function on , and take the growing sequence of sub-level sets as increases to be the filtration. Now given a filtration , through its course, new topological features (e.g, components, independent loops and voids, which are captured by the so-called homology classes) will sometimes appear and sometimes disappear. The persistent homology encodes the birth and death of such features in the persistence diagram . In particular, consists of a multiset of points in the plane, that is, a set of points with multiplicities, where each point with multiplicity intuitively means that independent topological features (homology classes) are created in and killed in . Thus, we also refer to and as the birth-time and death-time. The persistence of this feature is which is the lifetime of this feature. We refer to points in the persistence diagram as persistent-points.

Note that, in general, persistent-points lie above the diagonal in the plane. Points closer to the diagonal have lower lifetime (persistence) and thus are less important, with a point intuitively meaning a feature with persistence .

To compare two persistence diagrams and , intuitively, we wish to find a one-to-one correspondence between their multiset of points (and thus between the features they capture). However, the two sets may be of different cardinality, and we also wish to allow a persistent-point from one diagram to be “noise" and not present in the other diagram, which can be captured by allowing this point to be matched to its nearest neighbor projection in . Let be this projection, where . The following -th Wasserstein distance essentially captures this intuition [EdelsbrunnerHarer2010]. [p-Wasserstein distance for persistence diagrams] Given a persistence diagram , its augmentation consists of together with all points in each with infinite multiplicity. Given two persistence diagrams and , with their augmentations and , respectively, the -Wassertein distance between them is

(1)

where ranges over all possible bijections among the two sets. Note that is used to denote the inner -norm. If , the -Wasserstein distance is the classic bottleneck distance between persistence diagrams [PersistenceStability2007][Hera]. In this paper, we are interested in the case when . It turns out that an equivalent definition (which we will use in this paper) is as follows: [1-Wasserstein distance for persistence diagrams, version 2] Given two point sets and in , an augmented (perfect) matching for them is a subset such that (i) each or appears in exactly one pair in , and (ii) each is of the following three forms: (1) , (2) , or (3) .

Given two persistence diagrams and , the 1-Wasserstein distance between them is:

(2)

where ranges over all possible augmented matchings for and .

2.2 Relation to optimal transport

Readers may have already noticed the similarity between Definition 2.1 with the standard

-th Wasserstein distance between two probability measures. To avoid confusion, from now on we refer to 1-Wassertein distance as

optimal transport so as to differentiate from the use of 1-Wasserstein distance of persistence diagrams. [Optimal transport] Given a finite metric space and two measures , the optimal transport between them is

(3)

where , called a transport plan or a flow, is a measure on whose marginals equal to and , respectively; that is, and .

Given a multiset of points in the plane, note that we can view this as a discrete measure supported on points in , such that for each subset of , where is the multiplicity of in , while is the Dirac measure supported at . Hence in what follows, we sometimes abuse the notations and equate a multiset of points with the discrete measure induced by it, and talk about optimal transport between two multisets of points.

As shown in [Hera], one can consider and and modify the Euclidean distance so that for to obtain a modified pseudo-metric space . In this case, becomes the optimal transport between the discrete measures induced by and under this modified pseudo-metric.

We can also relate to the optimal transport between the discrete measures induced by and with the following observation (simple proof is in Appendix A): Let be the discrete measure induced by and be the discrete measure induced by . Then Given two discrete measures on a finite metric space , computing the optimal transport distance can be reduced to finding the optimal min-cost flow on a complete bipartite graph using combinatorial flow algorithms as described in [Kuhn]. In our setting later, and will both be induced by point sets in , and is the standard Euclidean distance.

2.3 Quadtree-based approximation algorithms for optimal transport

In this section, we briefly review two algorithms to approximate the optimal transport for two discrete measures and . Both of these algorithms use a randomly shifted quadtree, which we introduce first.

Randomly-shifted quadtree.

Let be a finite set of points (for our setting, for persistence diagrams). To simplify the description, we will assume that the minimum pairwise distance between any two points in is 1 and that is contained in (where , the ratio of the diameter of over the minimum pairwise distance, is also called the spread of ). First, let be the hypercube with side length which is centered at the origin. Now shift by a random vector whose coordinates are from to obtain . Note that still encloses as has side length .

Construct a tree of hypercubes by letting be associated to the root and halving along each dimension. Recurse on the resulting sub-hypercubes that contain at least one point from , and stop when a hypercube contains exactly one point from . Each leaf node of resulting quadtree contains exactly one point in , and there are exactly leaves. The resulting quadtree has at most levels. To see that has at most levels, consider the depth of some internal node. We know that the hypercube associated with the node has a side length of and the distance between any two points in the hypercube, , is less than or equal to . Then so there are at most levels in . Additionally, the size of is . It can be constructed in time where includes term polynomial in . We set the root level as level and subsequent levels are labeled as . The weight of each tree edge between level and level is . Note that the quadtree cell has side length at level .

Approximation algorithm 1: -embedding via .

Given two discrete measures and , let be the union of their support 111Note that in general, can be a superset of the support of and . Indeed, if there are a set of

measures and we perform kNN queries for a query measure, it is more convenient to set

as the union of support of all these measures and build only a single quadtree .. Construct the randomly shifted quadtree as described above; is sometimes omitted from the subscript when its choice is clear from the context. Given a tree node , its level is denoted by . We will abuse the notation slightly and use also to denote the quadtree cell (which is a hypercube of size length ). Given a discrete , then denotes the total measure of points from contained within this quadtree cell, namely, the total size of points with multiplicity counted from within this quadtree cell. We can now map to a vector where each index corresponds to a tree node , and has coordinates . Similarly, map to vector . Then Indyk and Thaper [IndykThaper2003] showed that gives an approximation to the optimal transport in expectation.

[[IndykThaper2003]] Given two discrete measures such that and , using a randomly shifted quadtree, can be calculated in time and there are constants such that . Here, stands for the expectation.

We note that it also turns out that gives exactly the optimal transport between and along the tree metric induced by . Specifically, for each , set its weight to be . Then for any , define to be the total weight of the unique tree path connecting the quadtree leaf (containing ) and leaf (containing ). Then the optimal transport between and w.r.t. metric , denoted by , satisfies that .

Furthermore, we can consider that this optimal transport is generated by a following greedy-flow : Starting from leaf-nodes, we will match up as many unmatched points to as we can within each node , and pass the remaining unmatched portion to its parent. In general, each tree node will have a -demand and , which collect all unmatched measure from its child nodes. -demand (resp. -demand) at a leaf node is initialized to be (resp. ). We then match these demand as much as we can and pass on to its parent as unmatched -measure, or unmatched -measure, whichever is left. Note that a greedy-flow is not unique, but it tuns out that any such greedy-flow (greedy transport plan) gives rise to the optimal transport distance between and w.r.t. the tree metric (See [Kalantari95] for more detail): i.e,

(4)
Approximation algorithm 2: Flowtree.

The flowtree algorithm by [Flowtree] is based on the previous approach. The only modification is that, consider a greedy-flow as described above. Instead of using the tree metric to compute the optimal transport distance, the flowtree estimate computes the cost of this flow using the standard Euclidean distance:

(5)

Comparing Equation (4) to the above equation, the difference is minor ( versus ). However, in practice, appears to provide a much more accurate estimate to the optimal transport distance w.r.t. the Euclidean distance. Unfortunately, unlike , which can be computed as a -distance between two specific vectors, to compute , we now have to compute a greedy-flow explicitly (which can be done linear in the size of quadtree; however conceptually, this is not as simple as -distance). Overall, we have the following result: [[Flowtree]] Given two discrete measures such that and , using a randomly shifted quadtree, can be computed in time and there are constants such that . Here, stands for the expectation.

3 Approximating 1-Wasserstein distances for persistence diagrams

We now present two algorithms to approximate the 1-Wasserstein distance for persistence diagrams, based on the approximation schemes of optimal transport in Section 2.3. Note that the results here are developed for the norm and through the equivalence of norms, can be generalized to any norm and only changes the constant factor in the distortion induced by each approximation.

3.1 Approximation algorithms via embedding

3.1.1 Description of the new quadtree-based -embedding

Let and be two persistence diagrams and let , the disjoint union of and . In what follows, for simplicity of presentation, we assume that the minimum distance between any two distinct points in , as well as between any point in with a point in the diagonal , is . The latter constraint can be removed with some extra care on handling leaf nodes in the quadtree. Assume w.l.o.g that is a power of .

Partition the (randomly shifted) hypercube described in section 2.3 into grids where the cells have side length . Note that each cell at the lowest level can contain at most one point, and if a leaf contains a point then it cannot intersect the diagonal . Let be the resulting quadtree, where leaves are all cells that contain exactly one point from . We further use to denote the set of quadtree cells with side-length (i.e, those in level-); we refer to as the level- grid. Note that the size of the quadtree is . Additionally, we call a cell a terminal cell if it intersects the diagonal ; otherwise, it is non-terminal.

Now for each grid , construct a vector with one coordinate per cell, where each coordinate counts the number of points in the corresponding cell. The vector representation for is then the concatenation of all these vectors where is the cell side length for grid :

Construct the vector similarly. We use to denote the value of coordinate in and to denote the value of coordinate in . Now, we will describe a modified- distance for these vectors, which is similar to the norm. To compute , we will define . There are two cases for the to consider:
Case 1: if coordinate is not associated with a terminal cell, then use for .
Case 2: if coordinate is associated with a terminal cell, then set .
Then we have , and

(6)
An equivalent -distance formulation.

We introduce the above vector representation and the modified -distance as it is more convenient for later theoretical analysis. However, algorithmically, we wish to have a true -embedding. It turns out that an equivalent formulation is as follows: Let denote the level- quadtree cells that do not intersect the diagonal . We then compute a vector representation (resp. ) restricted only to cells in . In other words, all entries corresponding to cells intersecting the diagonal are ignored in constructing and . We then have that

(7)

That is, our quadtree-induced distance is a -distance for suitably constructed vectors. Nevertheless, we use the definition as in Equation (6) to simplify proofs later.

It is easy to see that the construction takes the same time as the -embedding approach described in Section 2.3. We now show that the -distance approximates the 1-Wasserstein distance for the persistence diagrams. The main results for this -embedding approach are summarized as follows, and we prove the approximation bound in Section 3.1.2. Given persistence diagrams and such that , we can compute in time using the randomly shifted quadtree. Furthermore, the expected value of is an -approximation of the 1-Wasserstein distance ; i.e, there are constants and such that .

3.1.2 Approximation guarantees

Our approximation bound in Theorem 3.1.1 follows from Lemmas 1 and 1 below. To prove these lemmas, we will first introduce a greedy augmented matching.

Greedy augmented matching .

We construct the following augmented matching (recall Definition 2.1) in a bottom-up greedy manner: Starting from the level , we will aim to match points in as much as we can within each level . Those remaining unmatched points will then be considered at the next level : in particular, within each non-terminal cell in , we will match the maximal possible unmatched points in to unmatched points in so far, and pass the remainder unmatched points (which can now only come from either or , but not both) to its parents. Within a terminal cell in , we match every unmatched point from and from to its closest point in the diagonal. Finally, at the root cell (in level ), any unpaired points from either or will be paired to the diagonal, as the root is a terminal cell. Note that by construction, at any level , is exactly the number of points from that could not be matched at level or below under such a greedy augmented matching and will subsequently need to be matched in grid , . See Figure 1 for a simple illustration.

Figure 1: A simple example of the vector and quadtree representations of the persistence diagrams and .

There is a constant such that .

Proof.

Consider the greedy augmented matching we described above, induced by pairing points or pairing points with their diagonal projection greedily within the same cells of the grids in a bottom-up manner. First, given , we say that is paired in level-, if the lowest level any quadtree cell containing both is level-; intuitively, are paired greedily in this cell in level-. We now use to denote the set of pairs from level-. For each level , let be the total cost incurred by all those pairs from , and obviously,

(8)

where the right inequality holds as is the smallest total cost of any augmented matching and the greedy augmented matching is an augmented matching.

There are no pairings induced in the grid (of size ) since the minimum inter-point distance is and the minimum distance between a point and its diagonal is as well. Hence we have that . Since all points from and will remain unpaired in level (i.e, within cells of grid ), we then have that there are exactly

total number of points from that can either be paired to each other or matched to the diagonal in grid cells . As the maximal distance between any pair of matched points is within a cell in , the maximum cost incurred by all matched points in is In general, the maximum cost of the matched points in is

Hence combining Equation (8), the total cost (i.e, ) of the greedy augmented matching , is bounded from above by:

By the right inequality of Equation (8), the claim then follows. ∎

There is a constant such that the expected value of is bounded by .

Proof.

Set and as before. For a given grid and some coordinate in and , let be the value of coordinate in and be the value of coordinate in . Analogously, let and be the vector representations w.r.t multisets and , respectively, and let (resp. ) be the value of coordinate in (resp. ). Note that and .

(i) Now if or , then there exists at least one point in the cell associated with coordinate . In other words, this cell must be a terminal cell, and .

(ii) Otherwise if the conditions in (i) do not hold, it must be that and , in which case we also have that .

Combining (i) and (ii) we then have that .

On the other hand, by the definition of metric , we know that , implying that . Let and be the discrete measures induced by and respectively. By Theorem 2.3, there is some constant such that . From Observation 2.2, . Therefore,

The lemma then follows. ∎

3.2 Approximation algorithm via flowtree

We now propose an alternative approximation algorithm for . The high level idea is the same as the flowtree algorithm described in Section 2.3: in particular, we first compute the optimal flow for points in and along the randomly shifted quadtree as constructed earlier, but now with the modification that a point can be paired to diagonal. It turns out that this leads to the same greedy augmented matching we described at the beginning of Section 3.1.2. Then, similar to flowtree, under this greedy augmented matching , we use the Euclidean distance between a pair of matched points (instead of using the tree-distance as for ) to measure the cost of each pair of matched points. This leads to the following modified flowtree estimate:

(9)

and in our second algorithm, we will use as an approximation of the true 1-Wasserstein distance .

From an implementation point of view, unlike , which can be computed as the -distance between two vectors, we now must explicitly compute the greedy augmented matching, between and . (Note that this greedy augmented matching was only used in proving the approximation guarantee for , and not needed for its computation.) Computing this greedy augmented matching (and calculating its cost) takes the same time as computing the greedy flow in the original flowtree algorithm 2.3. Furthermore, it is easy to see that in the proof of Lemma 1, we in fact showed that (see Eqn (8)). Combining this with Theorem 2.3, we thus obtain the following approximation result for this modified flowtree estimate.

Given two persistence diagrams and where and is the spread of point set , we can compute in time using a randomly shifted quadtree. Additionally, the expected value of is an -approximation of the 1-Wasserstein distance ; i.e. there are constants and such that .

Remark. We remark that while these two approximation schemes, and , have similar approximation guarantees for , in practice, the modified flowtree based approach has much higher accuracy. This is consistent with the performance of flowtree algorithm versus the -embedding approach for general optimal distance [Flowtree]. In contrast, the benefit of the embedding approach is that it is easy to compute . Also, each persistence diagram is now mapped to a vector representation, and the distance is -distance among these vector representations. One could combine this with methods such as locality-sensitive-hashing for more efficient approximate -nearest neighbor queries. In general, such a -norm also makes

potentially more suitable for downstream machine learning pipelines.

4 Experimental results

We evaluate both the runtime and accuracy of the modified flowtree and -embedding against the Hera method of [Hera]. We use the implementation of Hera provided in the GUDHI library [gudhi] for testing. We run the experiments using an Intel Core i7-1065G7 CPU @ 1.30 GHz and 12.0GB RAM. Additionally, the implementations for both the modified flowtree and -embedding were done in C++ (wrapped in python for evaluation) and are based on the code provided in [Flowtree].

Datasets.

For our experiments, we use both synthetic persistence diagrams, as well as persistence diagrams generated from real data. For synthetic data, we generate two sets of persistence diagrams: called “synthetic-uniform" and “synthetic-Gaussian

", which are generated by a uniform sample and a sample w.r.t. a Gaussian distribution on the birth-death plane to obtain the persistence diagrams, respectively. For real datasets, we use persistence diagrams generated from the so-called Reddit data sets (which is a collection of graphs)

[real_datasets], and from the ModelNet10 [ModelNet] dataset of shapes. Details of these datasets are in Appendix B.

Speed comparison.

We compare the running time of our new approximation algorithms with that of Hera [Hera] – note that we do not directly compare with the exact algorithm by Dionysus, because as reported by [Hera], Hera is 50 times to 400 times faster than Dionysus. Note that Hera is also an approximation algorithm, and there is a parameter to adjust its approximation factor . By setting this parameter to be very small (), we use the distance computed by Hera as ground truth later when we measure approximation accuracy; see Table 1.

The comparison of the running times of our approaches with that of Hera (for a range of different approximation factors) can be found in Figure 2, which summarizes the runtime for each method on a log-scale using both randomly generated diagrams as well as the reddit-binary dataset (real persistence diagrams from graphs). We compare the speed of our modified flowtree and embedding against Hera where the parameter for Hera is set to be 300000. However, note that as one relaxes the approximation parameter , the speed of Hera in fact does not improve much (as shown in Figure 2). Thus our speed gains remain no matter which choice of we use for Hera. Additionally, the true approximation error for Hera also does not decrease much; see Table 1. To get the approximation error for Hera, we find the true Wasserstein distance by using the wasserstein_distance function from the GUDHI library which uses the Python Optimal Transport library [flamary2017pot] and is based on ideas from [lacombe2018large]. The results in Figure 2 indicate that the modified flowtree approach is between and times faster than Hera and the difference increases as the size of the diagrams increases. Similarly, the embedding approach is between and times faster than Hera. Both are order of magnitudes faster than Hera; but the price to pay is that the approximation factor is worse for our approach as shown in Appendix C Figure 3.

Approximation accuracy comparison.

To measure the accuracy of the approximation of both the modified flowtree and

embedding approach, we first measure the average relative error and standard deviation of both methods on all datasets using

, , and as the ground metrics. In particular, given a ground truth distance and an approximate distance , the relative error is . As mentioned earlier, we use the output of Hera for as ground truth, and compare our approximated distance with that. The results are summarized in Figure 3 and a detailed table of the average relative error and standard deviation is in Table 2. Overall, while our modified flowtree is slower than the -embedding approach, it achieves much better approximation error. For our experiments, we generate a quadtree only once for all persistence diagrams in a given dataset and calculate error for the approximated distances for the single quadtree. However, note that by constructing several quadtrees and averaging the distance estimates or taking the smallest estimated, we could potentially reduce the approximation error.

In addition to the average error of our approximate distances, we can also consider the efficacy of both methods in terms of nearest neighbor search and ranking accuracy. To evaluate nearest neighbor search, we first split the set of persistence diagrams into query diagrams and candidate diagrams. Then, we measure recall@ accuracy where recall@ is defined as the fraction of queries that have the true nearest neighbor within the top -ranked candidates returned by the evaluated method. The results are reported in Figure 4. For ranking accuracy, the detailed results are in Appendix C. To summarize, the modified flowtree approach is more accurate than the embedding approach both in terms of nearest neighbor search and closeness to the true ranking of candidate diagrams for a fixed query diagram. Both approaches appear to have a lower degree of accuracy on diagrams where there a higher proportion of points near the diagonal. This may be due to the increased possibility of erroneously matching points to the diagonal.

In summary, we note that both our new approximation algorithms significantly improve the speed previously best-known Hera algorithm by orders of magnitudes, but with worse approximation factors. Empirically, the approximation factors remain constant despite our theoretical results suggestion an -approximation. In particular, the relative approximation error of flowtree is often smaller than 0.50 for ground metric (see Appendix C Table 2).

Maximum allowed relative error Average relative error
0.1 0.00043768
1.0 0.003331
100000 0.0076055
200000 0.0076055
Table 1: Comparison of the maximum allowed relative error with the average experimental relative error for Hera on the reddit-binary dataset. The relative error was calculated using the GUDHI library’s wasserstein_distance function which uses the Python Optimal Transport library to compute exact 1-Wasserstein distance.
embd ft embd ft embd ft
synthetic-uniform Avg. Error 2.058 0.2846 3.161 0.2664 4.536 0.2595
Std. Dev. 1.034 0.3891 1.189 0.3488 1.164 0.3176
synthetic-Gaussian Avg. Error 1.341 0.3358 2.136 0.2860 3.035 0.2251
Std. Dev. 0.3647 0.1809 0.4139 0.1519 0.3669 0.1178
reddit-binary Avg. Error 2.112 0.2899 3.089 0.3080 3.921 0.2854
Std. Dev. 1.275 0.3859 2.100 0.5126 2.427 0.4801
ModelNet10 Avg. Error 2.189 0.7331 2.438 0.4929 3.061 0.9399
Std. Dev. 0.9543 0.4132 0.8136 0.3171 1.051 0.4448
Table 2: Average error and standard deviation for all datasets. We abbreviate the embedding approach to embd and the flowtree approach to ft.
synthetic-uniform.
reddit-binary.
Figure 2: Comparison of the runtimes of HERA, flowtree, and embedding using both generated and real data with as the ground metric.
reddit-binary.
ModelNet10.
synthetic-uniform.
synthetic-Gaussian.
Figure 3: Relative error for flowtree and quadtree approximations over , , and ground metrics for all datasets.
(a) Recall@ accuracy on reddit-binary dataset with ground metric.
(b) Recall@ accuracy ModelNet10 dataset with ground metric.
(c) Recall@ accuracy on synthetic-uniform dataset with ground metric
(d) Recall@ accuracy on synthetic-Gaussian dataset with ground metric
Figure 4: Recall@ accuracy on reddit-binary and ModelNet10 datasets with ground metric.

5 Concluding remarks

In this paper, we presented two algorithms for fast approximation of the 1-Wasserstein distance between persistence diagrams based on embedding. While the relative error incurred by both algorithms is higher than that of Hera, the runtime is significantly faster. We also observe that approximation methods introduced are more accurate on persistence diagrams with a lower proportion of points near the diagonal.

In the future, we are interested in using the embedding described with locality sensitive hashing for sub-linear nearest neighbor search. Additionally, it maybe be possible to use the ideas to compare persistence diagrams under some transformations: e.g, parallel shifting along the diagonal directions (which corresponding to that the input functions generating the persistence diagram is added by a constant term). It will also be interesting to expand this work to perform statistics on the space of persistence diagrams (e.g, computing 1-mean of persistence diagrams under our approximation distances).

References

Appendix A Additional Proofs

Proof of observation 2.2.

By the optimality of , we know that

where is the set of that have first form given in definition 2.1. We know that so . ∎

Appendix B Datasets

We use datasets of synthetic persistence diagrams as well as persistence diagrams generated from real data. For persistence diagrams from real data, we use both graph and shape datasets.

  • Synthetic data: We generated two sets of persistence diagrams. For the first set of synthetic persistence diagrams, we find a random persistence of size at most where for by generating points . To find each points , we sample

    from a uniform distribution from

    to . We then sample from a uniform distribution between and . We will refer to this set of synthetic persistence diagrams as synthetic-uniform. For the second set of synthetic persistence diagrams, we again generate a point by sampling from a uniform distribution from to . We then sample from a Gaussian distribution centered about with standard deviation 1.0. We will refer to the second set of synthetic diagrams as synthetic-Gaussian.

  • Graphs: For persistence diagrams generated from graphs, we use the reddit-binary graph dataset, which consists of graphs corresponding to online discussion on Reddit. In each graph, nodes correspond to users and there is an edge between two nodes if at least one of the corresponding users has responded to the other’s comments. The data is taken from four popular subreddits: IAmA, AskReddit, TrollXChromosomes, and atheism. Additionally, the persistence diagrams are generated using node degree as the filtration function [real_datasets].

  • Shapes: We use the ModelNet10 [ModelNet] dataset to generate persistence diagrams from shapes. ModelNet10 is comprised of 4899 CAD models from 10 object categories. The persistence diagrams are generated using closeness centrality as the filtration function [real_datasets].

The statistics of the ModelNet10 and reddit-binary datasets are summarized in Table 3. Note that a persistence point is considered close to the diagonal if its lifetime is less than one-tenth of the lifetime of the point with the largest lifetime.

Average # of PD points Average # of points near diagonal
reddit-binary 278.155 20.245
ModelNet10 40.52 34.345
Table 3: Diagram statistics for reddit-binary and ModelNet10 datasets.

Appendix C Results

To measure the accuracy of the rankings produced by the modified embedding and flowtree methods, we plot the true ranking of each candidate against the rank of the candidate in the rankings produced by the evaluated method. Note that we will do this using the norm as the ground metric. The ranking accuracies for the evaluated methods for the reddit-binary dataset and ModelNet10 are summarized in Figures 5, 6, 8, 7. The average number of ranks away from the true rank is less than 10 for both reddit-binary and synthetic-uniform whereas the same metric for synthetic-uniform and ModelNet10 is above ten for both datasets.

Both the modified flowtree and the embedding approximations seem to be less effective for estimating nearest neighbor for persistence diagrams where there is a high proportion of points near the diagonal. This may be because a higher proportion of points near the diagonal increases the possibility of erroneously matching points to the diagonal.

(a) Flowtree rankings.
(b) embedding rankings.