# Approximate Nearest Neighbors in the Space of Persistence Diagrams

Persistence diagrams are important tools in the field of topological data analysis that describe the presence and magnitude of features in a filtered topological space. However, current approaches for comparing a persistence diagram to a set of other persistence diagrams is linear in the number of diagrams or do not offer performance guarantees. In this paper, we apply concepts from locality-sensitive hashing to support approximate nearest neighbor search in the space of persistence diagrams. Given a set Γ of n (M,m)-bounded persistence diagrams, each with at most m points, we snap-round the points of each diagram to points on a cubical lattice and produce a key for each possible snap-rounding. Specifically, we fix a grid over each diagram at several resolutions and consider the snap-roundings of each diagram to the four nearest lattice points. Then, we propose a data structure with τ levels that stores all snap-roundings of each persistence diagram in Γ at each resolution. This data structure has size O(n5^mτ) to account for varying lattice resolutions as well as snap-roundings and the deletion of points with low persistence. To search for a persistence diagram, we compute a key for a query diagram by snapping each point to a lattice and deleting points of low persistence. Furthermore, as the lattice parameter decreases, searching our data structure yields a six-approximation of the nearest diagram in Γ in O((m+m^2)τ) time and a constant factor approximation of the kth nearest diagram in O((m+m^2+k)τ) time.

## Authors

• 10 publications
• 2 publications
• 3 publications
• 5 publications
• 5 publications
• 5 publications
• ### A flat persistence diagram for improved visualization of topological features in persistent homology

Visualization in the emerging field of topological data analysis has pro...
12/11/2018 ∙ by Raoul R. Wadhwa, et al. ∙ 0

• ### Skyline Diagram: Efficient Space Partitioning for Skyline Queries

Skyline queries are important in many application domains. In this paper...
12/04/2018 ∙ by Jinfei Liu, et al. ∙ 0

• ### Approximating Continuous Functions on Persistence Diagrams Using Template Functions

The persistence diagram is an increasingly useful tool arising from the ...
02/19/2019 ∙ by Jose A. Perea, et al. ∙ 0

• ### On Computing a Center Persistence Diagram

Given a set of persistence diagrams P_1,..., P_m, for the data reduction...
10/03/2019 ∙ by Yuya Higashikawa, et al. ∙ 0

• ### Parameter Inference with Bifurcation Diagrams

Estimation of parameters in differential equation models can be achieved...
06/08/2021 ∙ by Gregory Szep, et al. ∙ 0

• ### Residuum-Condition Diagram and Reduction of Over-Complete Endmember-Sets

Extracting reference spectra, or endmembers (EMs) from a given multi- or...
09/26/2018 ∙ by Christoph Schikora, et al. ∙ 0

• ### Topological Regularization via Persistence-Sensitive Optimization

Optimization, a key tool in machine learning and statistics, relies on r...
11/10/2020 ∙ by Arnur Nigmetov, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Computational topology is a field at the intersection of mathematics (algebraic topology) and computer science (algorithms and computational geometry). In recent years, the use of techniques from computational topology in application domains has been on the rise [1, 12, 22]. Furthermore, persistence diagrams can be used to reconstruct different types of simplicial complexes, which can be used to represent geometric objects and point clouds [26, 3]. These results provide new avenues to explore object classification and recognition in new an enlightening ways. More generally, current research is applying techniques from computational topology to big data. Computing distances between a set of persistence diagrams using a linear number of computations per diagram, however, can be computationally expensive. To address the expense, preliminary work by Kerber and Nigmetov [20]

looked at understanding the space of persistence diagrams through building a cover tree of a set of diagrams. To reduce the complexity of comparing a query diagram to a set of diagrams, Fabio and Ferri represent persistence diagrams as complex polynomials and compare the persistence diagrams using complex vectors storing coefficients for the polynomials

[7]. The research in [7], however, is experimental and offers no performance guarantees on the distance between two diagrams deemed to be close to one another by comparing the complex vectors. We address a similar problem, answering near neighbor queries in the space of persistence diagrams, providing a means of querying for near diagrams with performance guarantees.

Nearest neighbor search is a fundamental problem in computer science, i.e., databases, data mining and information retrieval, etc. The problem was posed in 1969 by Minsky and Papert [24]. For data in low-dimensional space, the problem is well-solved by first computing the Voronoi diagram of the data points as the underlying search structure and then performing point location queries for query points [9]

. When the dimension is large, such a method is known to be impractical as the query time typically has a constant factor which is exponential in dimension (known as the “curse of dimensionality”)

[5]. Then, researchers resort to approximate nearest neighbor search [2].

In many applications, for approximate nearest neighbor queries, the data in consideration are not necessarily (high-dimensional) points in Euclidean space. In 2002, Indyk considered the data to be a set of polygonal curves (each with at most vertices) and the distance between two curves is the discrete Fréchet distance, a data structure of exponential size was built so that an approximate nearest neighbor query (with a factor ) can be done in time [17]. Most recently, Driemel and Silvestri used locality-sensitive hashing to answer near neighbor queries (within a constant factor) in time using space (this bound is practical only for some [8]. In the case of persistence diagrams under the bottleneck distance, no research currently exists on how to find the nearest neighbor or even a near neighbor efficiently while offering performance guarantees.

#### Our Contributions

We study the near neighbor search and the -near neighbor problems in the space of persistence diagrams under the bottleneck distance. We present the first solution to the problem of searching in the space of persistence diagrams beyond explicitly computing pairwise distances, with performance guarantees. To address this problem, for , we develop a key function to compare two persistence diagrams  and using their keys and . We interpret as the resolution of the approximation. These keys demonstrate a hierarchical structure111In what follows, we will see that one of the diagrams will actually have multiple keys. However, we simplified the notation for this introductory paragraph.: if the bottleneck distance , then ; and if , then . These results extend the work conducted by Driemel and Silvestri who use a hash function to search in the space of curves under the discrete Fréchet distance [8]; however, this extension is nontrivial. Additional care must be taken when considering persistence diagrams, as points may be mapped to the diagonal when computing the bottleneck distance. As such, the crux of this research is managing points near the diagonal when generating keys for each diagram.

More formally, we summarize our results as follows. Given a set  of  -bounded persistence diagrams, we propose a key function that produces a set of keys for each of the snap-roundings of each diagram in to lattices. The keys are stored in a data structure , with levels and using  space. This data structure is capable of answering queries of the form: given a query persistence diagram with at most  points, return a six-approximation of the nearest neighbor to in in time or return a diagrams, each of which being a constant factor approximation of the th nearest diagram, in time. This data structure and algorithm is the first attempt at efficient searching in the space of persistence diagrams with performance guarantees.

## 2 Preliminaries

In this section, we give necessary definitions for persistence diagrams, bottleneck distance and additional concepts used throughout this paper. We assume that the readers are familiar with the basics of algorithms [6].

### 2.1 Persistence Diagrams and Bottleneck Distance

Homology is a tool from algebraic topology that describes the so-called holes

in topological spaces by assigning the space an abelian group for each dimension. When we are not given an exact topological space, but an estimate of it, we need to introduce some notion of scale. If each scale parameter

is assigned a topological space such that changes nicely with , then we track these changes using persistent homology. For further details on classical homology theory, the readers are referred to [14, 25] for homology and [10] for persistent homology. In this paper, we are working in the space of persistence diagrams under the bottleneck distance (which we will make more precise next), and are not concerned with where these diagrams came from.

Persistent homology tracks the birth and death of the topological features (i.e., the connected components, tunnels, and higher-dimensional ‘holes’) at multiple scales. A persistence diagram summarizes this information by representing the birth and death times ( and , respectively) of homology generators as points in the extended plane; see diagram. We comment that we could also represent a persistence diagram as a set of half-open intervals (barcodes) in the form of as in [27, 4]. We focus on the former representation and lay a grid over the diagram (see lattices). Let  denote the diagonal (the line ) with infinite multiplicity. Notice that points with small persistence are close to the diagonal and points with high persistence are far from the diagonal; in particular, the point  has distance  from .

Given persistence diagrams and , the bottleneck distance between them is defined as:

 dB(P,Q)=infϕsupp∈P∥p−ϕ(p)∥∞,

where the infimum is taken over all bijections . Notice that such an infimum exists, since is nonnegative and there exists at least one bijection  with finite bottleneck distance (namely, the one that matches every in to and every in  gets matched to itself). Let be the orthogonal projection of a point  to the closest point on . We define a matching between  and to be a set of edges such that no point in or appears more than once. We interpret these edges as pairing a point with either off-diagonal point or , and a point with either points an off-diagonal point or . Furthermore, a matching is perfect if every and is matched, i.e., every point is paired with the diagonal or a point from the other diagram and every point has degree one, see matching for an example. Letting and be the sets of off-diagonal points in and , respectively. Then, if , a perfect matching between and exists such that the length of each edge in is at most ; again, see matching. In this light, computing the bottleneck distance is equivalent to finding a perfect matching between  and  that minimizes the length of the longest edge; see [10, §VII.4] and [19], which use results from graph matching [23, 21, 11, 16].

The space of persistence diagrams under the bottleneck distance is, in fact, a metric space. In what follows, we set and consider diagrams with at most off-diagonal points such that each off diagonal point satisfies . We call such persistence diagrams -bounded. Throughout this paper, we use to denote the (metric) space of -bounded persistence diagrams under the bottleneck distance.

### 2.2 Lattices and Grids

A lattice in is an arrangement of points in a regular structure. A cubical lattice in  is a uniform scaling of the set of all integer-coordinates. In this paper, we consider a cubical lattice bounded by , where we only include coordinates such that . The lattice parameter of the cubical lattice, denoted in this paper, is the minimum distance between two distinct lattice points. In particular, we define to be the lattice centered in the square with such that the extreme lattice points are at exactly from the corners of . The complexity of the lattice is the number of lattice points: .

We think of the cubical lattice as defined by a regular grid: the decomposition of the plane into points (vertices or lattice points), line segments (edges), and two-cells (squares) by placing vertical and horizontal lines intersecting lattices points in with adjacent lines being a constant distance apart (in this case, apart). These squares are what we call grid cells in what follows. For simplicity of exposition, we assume that no input points lie on either grid lines or on edges in the Voronoi diagram of the lattice points. Thus, nearest neighbors in the lattice are unique, and every point has exactly four lattice points defining the grid cell containing it.

## 3 Generating and Searching Persistence Keys

In this section, we define a key function that maps a persistence diagram in to a vector in . The exponent is the complexity of the lattice, hence as increases, the keys become more discerning. We order the diagrams using the dictionary order on , and store the keys in a multilevel data structure that supports binary search. We note here that the hierarchical lattice is adapted from approaches to locality-sensitive hashing [18, 13, 15].

Let and . Let be a persistence diagram. We consider the lattice . We then snap each off-diagonal point to a lattice point and count the multiplicity for each lattice point. We note that while our key function was inspired by the hash function of [8] that ignored multiplicities, we must count the multiplicity of duplicated lattice points.

Recall from lattices that the lattice parameter is . We define our key function  by:

 \keyP\latticeMη=∑p∈\nodiagPenn(p),

where denotes the set of all cubical lattices, maps each off-diagonal to the index of the nearest lattice point and is the standard basis vector in , where ; see snapping for an example. For simplicity of the proofs to follow and so is well-defined, we assume that no persistence point lies on a Voronoi edge of . Of course, since many coordinates of are zero, we store it using a sparse vector representation; moreover, for the empty diagram  (i.e., with no point but the line ), we notice that .

This vector could also correspond to a product of prime numbers, where is an ordered set of prime numbers. Then, we have a unique integer for each vector . Doing so would put us in the more conventional setting, where we have indices into a hash table instead of keys. However, we would then either need to have a pre-generated list of primes (which adds to our storage space) or must account for computing the primes (which adds time complexity).

Suppose we snap-round before applying ; a natural choice for rounding each would be to one of the four lattice points defining the grid cell containing . Formally: [Snap Sets and Canonical Ordering] Given a lattice with lattice parameter , let denote the set of all possible snap-roundings of obtained by allowing each  to snap-round to one of the lattice points within distance of , i.e., one of the nearest four lattice points. Let denote the set of all snap-roundings of obtained by additionally allowing distance less than or equal to from the diagonal to be optionally deleted; see noisydgm for an example of points that are eligible for removal.

Each point , not lying on a grid line, can snap to the four lattices defining the grid cell containing . Then, we bound the number of snap-roundings for a fixed diagram:

[Enumerating Keys] If and if is a cubical latice centered on , then the number of keys in and have the following upper bounds:

• has size .

• has size .

We are now almost ready to prove collision, which shows that a query diagram collides with a snapping of diagram if and only if and are close diagrams. (Note that this ‘if and only if’ statement uses asymmetric notions of close). We first prove a simplified version in boundedcollision, where we consider the perfect matching problem in the extended plane . In this case, the proof is made easier as the two diagrams necessarily have the same number of points (as otherwise, a perfect matching is not possible). Then, to prove collision, we delete points that are less than or equal to distance of in the query diagram and observe that some of these points could have been matched with off-diagonal points in with persistence up to, and including, .

[Collision without Diagonal Interference] Let be finite -bounded point clouds. Let and consider the lattice . Let denote the lattice parameter. Then,

1. if such that , then ;

2. if , then such that .

###### Proof.

Recall from lattices that and .

For the first part, let such that . Then, we construct a matching between  and iteratively by peeling off pairs that are mapped to the same lattice point . By enumerating, we know that was snap-rounded to one of the four lattice points within distance of . Additionally, we know that was snap-rounded to the closest lattice point. Hence, we have and . Thus, by the triangle inequality, we have ; an example of this situation is shown in onecorneronecorner-best. Therefore, if , we have .

To prove the second part, we assume that . Then, let be the perfect matching that realizes the bottleneck distance. For each pair , let be the lattice point to which  snaps in , as illustrated in onecorneronecorner-worst. Then, because it is the nearest point to . Since , we know that and by the triangle inequality. Thus, a snap-rounding of exists such that each  is snapped to ; we denote this snap-rounding . Since and each pair in share a lattice point in and , we conclude that . ∎

The above lemma is restricted to matchings of points in . Next, we generalize boundedcollision, by allowing matchings to the diagonal. This is the central theorem of this paper.

[Collisions between Diagrams] Let . Let  and consider the lattice . Let denote the lattice parameter, and let be the diagram obtained from by removing all points less than or equal to distance from the diagonal. Then:

1. If such that , then .

2. If , then such that .

###### Proof.

Again, recall from lattices that and .

We start with the first part. Let such that . Then, each off-diagonal persistence point is either snap-rounded to one of its neighbors within distance  or deleted in order to obtain . Then, we construct a matching  between  and  from boundedcollision Part 1 with bottleneck cost at most ; next, we add to the matching in order to extend the matching to a perfect matching between and by considering unmatched vertices of  followed by unmatched vertices of . Notice that all deleted vertices of  are within  of the diagonal by construction. Furthermore, all vertices in  must have been deleted by , and hence are within  of the diagonal. We add all of these pairs (i.e., between and the diagonal, and between and the diagonal) to , thus obtaining a perfect matching between  and with bottleneck cost at most .

We now prove the second part. Assume that . Let be a perfect matching that realizes the bottleneck distance between and . We use this matching to construct a diagram by choosing which points of to snap-round to lattice points and which points to delete; see noisydgm for an example of the points that are eligible for removal during the snap-rounding. For each with , we know that (recall that denotes the diagonal) since . Letting be the closest lattice point to (just as we did in boundedcollision), we snap-round to in . On the other hand, notice that for all with (i.e., ), we know that

 d∞(p,D)≤d∞(p,q)+d∞(q,D)≤dB(P,Q)+12δ≤δ,

and so we choose to delete ( is deleted by the definition of ). Therefore, we have constructed a diagram such that . ∎

This result implies that diagrams with a small bottleneck distance relative to the chosen  value will be hashed together while diagrams with a large bottleneck distance, relative to , will not. Next, using collision, we discuss a multi-level data structure that, for some query diagram , supports searching for approximate nearest neighbors in .

## 4 Determining Approximate Nearest Neighbors

In collision, we saw that for a query diagram and scale , will share a key with some diagram  ‘if and only if’ they are close, with respect to the chosen scale . To find the near-neighbor, we must select a with the correct relationship to . The relationship presents two problems. First, how do we determine the correct value for ? Second, a single  value would rarely be sufficient for all queries.

In this section, we build a multi-level data structure to support approximate nearest neighbor queries in the space of persistence diagrams. Each level of the data structure corresponds to a lattice with a different resolution. In the previous section, we needed a flexible notion for and , but in this section, the data structure level and lattice are dependent. So, we simplify notation. Recall that, as our persistence diagrams are all -bounded (i.e., in ), all points lie in . For , we define

• , that is, the lattice parameter for

[Data Structure]

Let be finite, let , and let be the minimum bottleneck distance between any two diagrams in . Let , then for each integer , we define  to be the data structure that stores the sorted list of keys , for each and . With each key, we store a list of persistence diagrams from which have a snap-rounding to that key, and the number of diagrams from which snap to the key. We note that a diagram with a given key can be found in time logarithmic in the number of distinct keys at that level. We denote the array of the multi-level data structure as . We can access a given level in constant time.

In the definition above, the choice of and provides a point at which the diagrams with the smallest bottleneck distance stop colliding and we can stop considering smaller values of . In particular, we choose , because the contrapositive of collisionsnapImpliesBound implies that if , then for any . Thus, we can guarantee that that  will share a key with a representative of , for each  close enough to .

For each level, each diagram has snap-roundings and keys that can each be generated in time. Comparing two keys to determine their relative order requires  time.

For each diagram at each level, we can determine the set of unique keys in by sorting and the keys and removing duplicates. Finding the unique keys for diagrams takes . Sorting the keys at a given level for diagrams takes time. Creating a list of diagrams for each unique key at a given level requires time but this operation is asymptotically smaller than the complexity of sorting the keys. Then, generating the data structure with levels takes time. Next, we consider some properties of , specifically, that collisions on a level of the data structure with a fine resolution imply collisions between the same diagrams on levels with coarser resolutions. To simplify notation, for and , we let be the diagram obtained from by removing all points less than or equal to distance  of the diagonal.

[Hierarchical Collision] Let be finite. Let and . Let and let and be the diagrams obtained from by removing all points less than or equal to distance  (resp., ) of the diagonal. Suppose there exists such that  (i.e.,  and collide in level ), then for any , there exists such that .

###### Proof.

It suffices to prove that if and collide in  then and  collide in . From collisionsnapImpliesBound, since and , a perfect matching  exists between and such that , . We use this matching to find such that , which happens when there exists a perfect matching such that for each , either , where is the closest lattice point to , or . Then, we show that we can construct . We begin with , and for each we construct one, or more, edges and add them to . To construct each edge, we consider three cases: Case , where ; Case , where ; and Case , where neither nor are in , which has two subcases.

Case (): Since and , then is within of . Since , we know that lies at most from and can be matched with in a matching corresponding to a collision at as well. Thus, we add to .

Case (): Since and , then is within of . Then, is at most from , and it can be matched with in a matching corresponding to a collision at  as well. Thus, we add to .

Case (neither nor are in  in ): By construction, in level , points  and  snap to the same lattice point  such that and , since  is snapped to the nearest lattice point and  is snapped to one of the four nearest lattice points.

Subcase (). We add to , since must be snap-rounded to in . In order to match in , we show that also has a snap-rounding to in level . Since , we know that either or . Since , the triangle inequality implies that . Therefore,  can be deleted in the snap-rounding in , so we add to .

Subcase (). Let be the lattice point in level to which snap-rounds. Since is more than from , , which implies that by the triangle inequality. Therefore, we can snap-round to  in . So, we add to .

Finally, we find as follows: for every such that is off-diagonal, we either (1) delete if is on , or (2) snap-round to the lattice point in  nearest to . Therefore, we conclude that if and collide at , then and also collide at . ∎

To find a near neighbor to in , we determine the last level such that has an existing key. However, first we must consider where the nearest neighbor lies relative to this level.

[Nearest Neighbor Bin] Let and be as defined in weakCollision. Let be obtained from by removing all points within of the diagonal. Let  be the nearest neighbor of in , with respect to the bottleneck distance between diagrams. Let be the largest index such that has a collision in . Then, there exists a snap-rounding such that .

###### Proof.

Assume that does not have a collision with . By the contrapositive of collisionboundImpliesSnap, . As collides with in , by collisionsnapImpliesBound, . Combining the inequalities, we get , which implies that is not the nearest neighbor, a contradiction. ∎

A result of nnBin is that if is the largest index such that has a collision in , then we can construct examples in which does not collide with the nearest neighbor in . Next, we show that any diagram colliding with in is an approximate nearest neighbor.

[Nearest Neighbor Approximation] Let and be as defined in weakCollision. Let be the largest index such that . Let be the nearest neighbor of in terms of bottleneck distance. The bottleneck distance between and every diagram of with a key in is a six-approximation of the .

###### Proof.

Let and such that . In other words, is a snap-rounding of that collides with at level . By nnBin, has a snap-rounding in colliding with . And, by our assumption, as has no collisions in level , may have its last collision with in , , or . To bound the bottleneck distance, we must only consider the worst-case scenario, where is as small as possible and is as large as possible. As and collide in level , by collisionsnapImpliesBound, . And, by the contrapositive of collisionboundImpliesSnap, if the last collision of and is: in , then ; in , then ; or in , then . Therefore,

 12δi+1

which implies that every diagram with a key in colliding with is a six-approximation of the nearest neighbor of  in terms of bottleneck distance. ∎

The previous discussion tells us that we can find an approximate nearest neighbor by identifying the bin in the lowest level with a collision and picking any diagram in that bin. Moreover, it tells us that if we want to find the true nearest neighbor, we could linearly search for it though all diagrams with a collision two levels up. Next, we prove that we can query for an approximate th nearest neighbor for . First, we establish bounds on the location of th nearest neighbor, generalizing the results from nnBin.

[th-NN Location Upper Bound] Let and be as defined in weakCollision. Let  be the th nearest neighbor of in , with respect to the bottleneck distance between diagrams. Let be the largest index such that the number of distinct diagrams with snap-roundings and keys equal to  in is at least . Then, there exists a snap-rounding such that .

###### Proof.

Assume, for contradiction, that does not have a collision with in . By the contrapositive of collisionboundImpliesSnap, . Furthermore, by collisionsnapImpliesBound, if a diagram collides with in , then . Since there are at least collisions with in , there must be diagrams with bottleneck distance less than or equal to . The previous statement, however, is a contradiction to the claim that is the th nearest neighbor of with respect to bottleneck distance. ∎

Then, we also bound the number of levels with a finer grid resolution that the th-nearest neighbor can collide with a snap-rounding of the query diagram.

[th-NN Location Lower Bound] Let and be as defined in weakCollision. Let be the th nearest neighbor of in , with respect to the bottleneck distance between diagrams. Let be the largest index such that the number of distinct diagrams with snap-roundings and keys equal to  in is at least . Then, does not have a snap-rounding and key colliding with in any such that .

###### Proof.

If has a snap-rounding that collides with in any for then it will also have a snap-rounding colliding with in by weakCollision. Therefore, it suffices to show that does not collide with in . Then, suppose, by contradiction, that has a snap-rounding that collides with in . Then, by collisionsnapImpliesBound, we know that which implies that at least diagrams in have distance at most from . Furthermore, by collisionboundImpliesSnap, we know that all such that collide with in . Hence, has at least collisions in , which contradicts our choice of . ∎

As a result of the previous two lemmas, we bound levels for which can collide with .

[th-NN Location] Let and be as defined in weakCollision. Let be the th nearest neighbor of , with respect to the bottleneck distance between diagrams. Let be the largest index such that the number of distinct diagrams with snap-roundings and keys equal to  at is at least . Let be the largest level in such that there is a snap rounding and key of colliding with . Then, , i.e., must have a snap-rounding in, at most, and in, at least, .

To find the th nearest neighbor to in , we determine the last level of such that  has at least matching keys. The proof is a slight modification of the proof of nn.

[th-Nearest Neighbor Approximation] Let and be as defined in weakCollision. Let be a positive integer greater than one. Let be the th nearest neighbor of , with respect to the bottleneck distance between diagrams. Let be the largest index such that the number of distinct diagrams with snap-roundings and keys equal to  at is at least . The bottleneck distance between and every diagram of with a key in is a -approximation of the .

###### Proof.

Let and such that . That is, is one of the  distinct diagrams that collide with at level . By kNNLevels, has a collision with in , , , , or . To bound the bottleneck distance, we must only consider the worst-case scenario, in which is as small as possible and is as large as possible. As and collide in level , by collisionsnapImpliesBound, . And, by the contrapositive of collisionboundImpliesSnap, for , if the last collision of and is in , then . The distance is smallest when , that is, . Therefore,

 dB(P,Q)≤32δi=24(116δi)<24(dB(Pk,Q)),

which implies that every diagram with a key in colliding with is a -approximation of the th nearest neighbor of  in terms of bottleneck distance. ∎

The approximation factor is controlled by the last level where and collide. So, while we do not propose an efficient test for identifying the last level, we observe that in some cases, the approximation factor is much tighter. For example, if last collides with  in , then for all that collide with in , .

We now describe how to identify approximate nearest neighbors for a query diagram.

[Approximate Nearest Neighbor Query] Let and be as defined in weakCollision. Let and let be the multi-level structure described in bigDS with levels. Then, the data structure is of size and supports finding a six-approximation of the nearest neighbor of in in  time.

###### Proof.

We begin by describing how we can use the previous lemmas to organize and search . weakCollision tells us that for , then which implies that  can be ordered by . Using the ordering, we can perform a binary search to find the largest such that . Every diagram with a key in  colliding with is a six-approximation of the nearest neighbor of  by nn.

We begin by analyzing the space of . The structure contains levels. For each level, we store, in increasing order, the snap-roundings for each of the persistence diagrams. So, the total space of is .

Next, we consider the complexity of finding a six-approximation of the nearest neighbor for . Searching for the largest such that requires a binary search through and another binary search through each  that is encountered to search for a collision.

The time for each search in  is analyzed using three observations. First, generating a key for , i.e., takes time. Second, comparing the keys of two diagrams takes time. Third, let , since each diagram has hashes, each has at most keys. Using these three observations, we get that searching takes . The search time can be simplified. Since, , the search at each  is time. Since we search  levels of and the search at each level is . The total query time is . ∎

We note that exponential search could replace binary search for both finding the last where collides with another key as well as on each . If is the largest  such that the snap-rounding of collides with another key, and  is the index of the key in that collided with then the query time becomes .

Finally, we prove that this data structure can provide responses to queries requesting the -nearest neighbors. Specifically, the -nearest neighbors returned are a -approximation of the th nearest neighbor.

[-Nearest Neighbor Query] Let and be as defined in weakCollision. Let and let be the multi-level structure described in bigDS with levels. There exists a data structure of size that supports finding diagrams that are each, in the worst case, a -approximation of the th nearest neighbor of from in time.

###### Proof.

We begin by searching to find the largest such that: , and the list of diagrams at key is of at least length . We can still utilize binary search to traverse the levels of . At each level, however, once a matching key is found, we must determine if there are unique neighboring keys with the same value. Recall that at each key, we stored the count of the number of unique diagrams. Thus, we can determine the number of unique diagrams hashed to a particular key in constant time. Then, searching each level takes time from approxnn. Once the largest with colliding diagrams is found, we return any diagrams from the list at . Any of these colliding diagrams will be a -approximation of the th nearest neighbor of by knnApprox. Finding diagrams from the list at takes time time, so the total time complexity for searching and returning diagrams at a particular level is , making the overall time complexity for searching . No modifications need to be made to  to support these queries so the space complexity remains the same from approxnn. ∎

## 5 Concluding Remarks

In this paper, we address the problem of supporting approximate nearest neighbor search for a query persistence diagram among a finite set of -bounded persistence diagrams. To the best of our knowledge, this result is the first to introduce a method of searching a set of persistence diagrams with a query diagram with performance guarantees that does not require a linear number of bottleneck distance computations. We utilize ideas from locality-sensitive hashing along with a snap-rounding technique to generate keys for a data structure which supports searching which has levels. Specifically, when , the search time for an -bounded query diagram is and returns an approximate nearest persistent diagram within a factor of six. Additionally, searching for approximate nearest neighbors can be done in and each of the diagrams are within a factor of twenty-four of the th nearest neighbor. We note that, while our space complexity is exponential, our queries do not rely on probabilistic snap-roundings and the decrease in size is significant over similar approaches from [8]. Specifically, for a data structure, storing curves in , each with complexity at most , the constant factor approximation from [8] requires space; whereas, our approach to storing diagrams with at most off-diagonal points requires  space.

For simplicity, we assumed that none of the points in the diagrams of are on grid lines. To handle points on grid lines, we add additional keys. More specifically, for a diagram  and a lattice in which a point and , we snap to its nine nearest neighbors of . If is on a grid line of (and not a grid point) we snap to  to its six nearest neighbors. While the additional keys increase the size of , the space complexity of storing or time complexity of querying does not change.

While searching is logarithmic in the number of diagrams, the data structure becomes very large when the diagrams have even a moderate number of points. For example, with