Log In Sign Up

A fast approximate skeleton with guarantees for any cloud of points in a Euclidean space

by   Yury Elkin, et al.

The tree reconstruction problem is to find an embedded straight-line tree that approximates a given cloud of unorganized points in ℝ^m up to a certain error. A practical solution to this problem will accelerate a discovery of new colloidal products with desired physical properties such as viscosity. We define the Approximate Skeleton of any finite point cloud C in a Euclidean space with theoretical guarantees. The Approximate Skeleton ASk(C) always belongs to a given offset of C, i.e. the maximum distance from C to ASk(C) can be a given maximum error. The number of vertices in the Approximate Skeleton is close to the minimum number in an optimal tree by factor 2. The new Approximate Skeleton of any unorganized point cloud C is computed in a near linear time in the number of points in C. Finally, the Approximate Skeleton outperforms past skeletonization algorithms on the size and accuracy of reconstruction for a large dataset of real micelles and random clouds.


page 1

page 2

page 3

page 4


A novel tree-structured point cloud dataset for skeletonization algorithm evaluation

Curve skeleton extraction from unorganized point cloud is a fundamental ...

Skeletonisation Algorithms for Unorganised Point Clouds with Theoretical Guarantees

Real datasets often come in the form of an unorganised cloud of points. ...

Skeletonisation Algorithms with Theoretical Guarantees for Unorganised Point Clouds with High Levels of Noise

Data Science aims to extract meaningful knowledge from unorganised data....

Linear Algorithm for Digital Euclidean Connected Skeleton

The skeleton is an essential shape characteristic providing a compact re...

Approximate Data Depth Revisited

Halfspace depth and β-skeleton depth are two types of depth functions in...

Optimal Skeleton Huffman Trees Revisited

A skeleton Huffman tree is a Huffman tree in which all disjoint maximal ...

Reconstructing a convex polygon from its ω-cloud

An ω-wedge is the (closed) set of all points contained between two rays ...

1 Introduction: reconstructions from unorganized clouds

Potential molecules for new colloidal products are tested by simulations that produce unorganized finite clouds of points (one point per molecule in Fig. 1). Molecules tend to form clusters (called micelles) whose shapes (degrees of branching, edge-lengths) affect physical properties of colloidal products, e.g. their viscosity.

These 3D micelles can have complicated branched shapes as in Fig. 7 and are visually analyzed by human experts who struggle to make reliable measurements quickly. To substantially speed-up the discovery of new molecules, we propose a new Approximate Skeleton to solve the following problem.

The tree reconstruction problem. Given a point cloud and an error , design a fast algorithm to build a straight-line tree (see Definition 1) that has a minimum number of vertices and whose -offset (neighborhood) covers .

The first (combinatorial) guarantee is for the number of vertices in , which is close to the minimum number in an optimal tree for a given approximation error by factor 2, see Theorem 8. The second (geometric) guarantee about a near linear time for building is the number of points in , see Corollary 9.

Figure 1: Left: point clouds from real micelles. Right: Approximate Skeletons .

To automatically characterize branching shapes of micelles (clusters of molecules in colloids), an Approximate Skeleton allows us to compute

the topological type of any unorganized cloud , e.g. count all non-trivial vertices of whose degree is 1 (endpoints) or more than 2 (branching);

the geometric characteristics of , e.g. edge-lengths of ;

the error of approximating a cloud by its skeleton , see Table 10.

Here is the pipeline of the Approximate Skeleton .

Stage 1 in section 3: for a cloud , we build an initial tree , which has a small number of branching vertices within a Minimum Spanning Tree of .

Stage 2 in section 4: replace polygonal paths of by approximate paths with much fewer vertices to get in a near linear time within a given error.

Figure 2: Pipeline to compute an Approximate Skeleton : is classical, the new subtree is introduced in Definition 6 in section 3, final is built in section 4


The key novelty and contributions to the data skeletonization are the following.

Theorem 8 guarantees a small number of vertices in the Approximate Skeleton close to the minimum by factor 2 in an optimal tree within a given error.

Corollary 9 guarantees a near linear time to compute within an error.

2 Basic definitions and a review of the related past work

Definition 1 (a straight-line graph, -approximation)

A straight-line graph (briefly, a graph) consists of vertices at points and undirected straight-line edges connecting pairs , , in such a way that any edges meet only at their common vertex. Let be the Euclidean distance. For , a cloud is -approximated by a graph if is within the -offset that is the union of -balls at all points of , i.e. .

Past algorithms without guarantees. Singh et al. Singh et al. (2000) approximated a cloud by a subgraph of a Delaunay triangulation, which requires time for points of and the three thresholds: a minimum number of edges in a cycle and for inserting/merging 2nd order Voronoi regions. Similar parameters are need for principal curves Kégl and Krzyzak (2002), which were later extended to iteratively computed elastic maps Gorban and Zinovyev (2009)

. Since it is often hard to estimate a rate of convergence for iterative algorithms, we discuss below non-iterative methods with theoretical guarantees.

The metric graph reconstruction (MGR) takes as an input a large metric graph , which is an abstract graph with weighted edges and outputs a smaller abstract metric graph . The distance between any points of a metric graph is defined as the length of a shortest path these points. If is a good -approximation to an unknown graph , then M. Aanjaneya et al. (Aanjaneya et al., 2012, Theorem 5) proved the existence of a homeomorphism that distorts the metrics on and with a multiplicative factor for , where is the length of a shortest edge of .

The authors of the Reeb graph skeletonization (Ge et al., 2011, page 3) have checked that for the MGR algorithm from Aanjaneya et al. (2012) “it is often hard to find suitable parameters in practice, and such local decisions tend to be less reliable when the input data are not as nice (such as a ‘fat’ junction region)”, see this junction in the 2nd picture of Fig. 1.

Definition 2 (a Reeb graph)

Given a topological space (or ) with a function , the Reeb graph is obtained from by collapsing each connected components of every level set of to a single point, so the Reeb graph is the quotient of by the equivalence relation if and only if and the points are in the same connected component of .

Skeletonization via Reeb-type graphs. The Vietoris-Rips complex on a cloud consists of all simplices spanned by points whose pairwise distances are at most . Starting from a noisy sample of an unknown graph with a scale parameter, X. Ge et al. (Ge et al., 2011, Theorem 3.1) proved that the Reeb graph of has a correct homotopy type if there is a triangulated space with a continuous deformation that -approximates the metrics of . The homotopy type of a graph is the equivalence class of graphs under deformations when any edge (with distinct endpoints) can be collapsed to a point.

The graph reconstruction by discrete Morse theory (DMT). T. Dey et al. Dey et al. (2018) substantially improved the discrete Morse-based framework Delgado-Friedrichs et al. (2015) and proved new homotopy guarantees when an input is a density function , which ‘concentrates’ around a hidden geometric graph

. The key advantage of this approach is the unbounded noise model that allows outliers far away from the underlying graph

, which has found practical applications to map reconstructions Wang et al. (2015); Dey et al. (2017).

Since the molecules of a micelle form an unorganized cloud of points (with large bounded noise) around hidden tree structures, the Tree Reconstruction problem in section 1 essentially differs from the above approaches. An initial unorganized cloud of points is not an abstract metric graph (as in the metric graph reconstruction problem) and not a simplicial complex with scalar values at vertices (as in the discrete Morse theory approach), so extra pre-processing was needed in section 5.

The -Reeb graph by F. Chazal et al. Chazal et al. (2015) solves the metric graph reconstruction problem, where the input is not an unorganized cloud, but a large metric graph that should be approximated by a smaller graph . For a base point , the image of the distance function is covered by intervals having a length and 50% overlap. Every connected component of defines a node in the -Reeb graph . Two nodes are linked if the corresponding components overlap.

Informally, controls the size of a subset of that maps to a single vertex of . Theorem 4.9 in Chazal et al. (2015) says that if is -close to an unknown graph with edges of minimum length , the output is -close to in the Gromov-Hausdorff distance between spaces, not within one space, where is the first Betti number of . The algorithm has the fast time for points in . Similarly to Reeb graphs, -Reeb graphs are abstract without an intrinsic embedding into the space of the cloud and can have self-intersections even for .

The Mapper Singh et al. (2007) extends any clustering algorithm and outputs a network of interlinked clusters and needs a user-defined function , which helps to link different clusters of a cloud . Another parameter is a covering of the image of by a given number of intervals (often with 50% overlap). Each of subclouds is clustered. Every cluster defines a node in the Mapper graph. Two nodes are linked if the corresponding clusters overlap. M. Carriére et al. Carriere and Oudot (2017) have proved first theoretical guarantees for the Mapper output.

More recent persistence-based algorithms for graph reconstruction Kurlin,V. (2015a, b); Kalisnik et al. (2019) and image segmentation Forsythe et al. (2016); Forsythe and Kurlin (2017); Kurlin and Harvey (2018); Kurlin and Muszynski (2020) essentially find most persistent cycles hidden in a cloud, hence go beyond the tree reconstruction problem in section 1.

Straightening polygonal curves

is a key ingredient in many skeletonization algorithms. Douglas-Peucker’s heuristic

Douglas and Peucker (1973) approximates a long zigzag line by a simpler line with fewer vertices, see section 4. The elegant algorithm by P. Agarwal et al. Agarwal et al. (2005) guarantees a near linear time and a small number of vertices in a final polygonal approximation when used with the Frechet distance between curves in . For the Hausdorff distance and higher dimensions, there is no near linear time straightening with guarantees on the size of a skeleton to our best knowledge.

Definition 3 ()

For a cloud , a Minimum Spanning Tree is a connected graph that has (1) the vertex set , (2) no cycles, and (3) a minimum total length, where lengths of edges are measured in the Euclidean distance.

If all distances between points of are distinct, then is unique. We write a Minimum Spanning Tree, similarly an Approximate Skeleton, to cover all cases.

Theorem 4

(March et al., 2010, Theorem 5.1) For any cloud of points, a Minimum Spanning Tree can be computed in time , where is the inverse Ackermann function; are defined in Appendix A.

3 A new tree defined for any point cloud

This section introduces an important subtree , which has many fewer non-trivial vertices than a usually ‘hairy’ from Definition 3.

A tree might still have too many zigzags and will be replaced by a better tree with fewer vertices in section 4. A vertex of a degree is called (topologically) non-trivial, because any vertex of degree 2 can be potentially removed by straightening algorithms in section 4. Since contains many non-trivial vertices, the next hard step is to identify those few vertices of that represent ‘true’ vertices of a tree , which we try to reconstruct from .

Figure 3: Left. One vertex (large red dot at the bottom) of has a high depth by Definition 5 and is connected by longest paths to 3 vertices of degree 1. The other vertices have at most 2 disjoint long paths within . Right. The red monotone paths of and subclouds from Algorithm 2 are shown disjointly.

Definition 5 introduces the depth characterizing how deep a vertex sits within . At a deep vertex of a degree at least 3 sufficiently long paths (without common edges) should meet, see the 3 red long paths in Fig. 3. The previous procedural approach by M. Aanjaneya et al. (Aanjaneya et al., 2012, Fig. 1b) to detect branching points in a shape of used more parameters than a single branching factor below.

Definition 5 (deep vertices)

For a cloud and a vertex of a degree , let be the branches (subtrees) joined at the vertex . Let be the length of a longest path within the branch from to another vertex, . Assuming that , set . Let be the average edge-length of . For a branching factor , the vertices of whose depths are larger than are called deep.

Taking the minimum above guarantees that vertices in any short branches of are not deep, hence deep vertices can not form small cliques. The experiments on real micelles in section 5 justify that Definition 5 separates deep vertices from other shallow vertices for a long range of the factor .

Figure 4: Left: black for the two clouds in Fig. 1. Right: red in Definition 6.

Definition 6 introduces a subtree , which non-essentially depends on the branching factor and better approximates a cloud than , see Fig. 3.

Definition 6 ()

In Definition 5, if we remove all deep vertices , splits into several subtrees. If the closure of such a subtree has two deep vertices , they are joined by a unique path . If has one deep vertex , take a longest path from to another vertex . We ignore if its length is less than , where is the branching factor from Definition 5. All the vertices and the paths between them form the subtree .

If the closure of a subtree above has deep vertices , then contains a vertex with at least 3 paths to . Then , , so is also deep and should be split by removing . Hence .

In Fig. 3 the two black edges at the red deep vertex of degree 5 are too short, hence ignored in Definition 6. The tree consists of only 3 red long paths meeting at . Here are the steps of Stage 1 for the Approximate Skeleton .

Step 1a. If needed, split a cloud in clusters to approximate them below.

Step 1b. If is one cluster, find by the fast algorithm from Theorem 4.

Step 1c. Find the depths of vertices in by Algorithm 1 in Appendix A.

Step 1d. Identify all deep vertices of by their .

Step 1e. The subtree is formed by all the paths and from Definition 6 that have lengths more than , where is a given branching factor.

4 : Approximate Skeleton of a cloud

The tree from Definition 6 has only few non-trivial vertices, but contains noisy zigzags with too many trivial vertices of degree 2. This section discusses how to straighten these zigzags and decrease the total number of vertices.

We have tried Douglas-Peucker’s heuristic Douglas and Peucker (1973), which was rather unstable and produced large zigzags on curved micelles in Fig. 1. The worst complexity is in the number of points for . A final approximation can have a size even in . Another problem with Douglas and Peucker (1973) are potential self-intersections even in , which are caused by large zigzags that approximate non-monotone curves Wu et al. (2004).

The problem of straightening polygonal paths in a tree is harder than the curve simplification, because the input is a cloud of unorganized points. So a final approximation should take into account the points of a cloud outside .

Definition 7

Let be a straight line. An ordered cloud is called monotone with respect to if the order of points is preserved by the orthogonal projection of to .

Since there are many paths of to straighten, we split the cloud into monotone subclouds as formalized in Algorithm 2 in Appendix A. Since monotone subpaths can be quickly found only in Shin and Kim (2004), Theorem 8 below will assume that each subpath between non-trivial vertices of is monotone by Definition 7 with respect to the straight line connecting the endpoints of .

All results in this section are proved in Appendix A. Here are the Stage 2 steps.

Step 2a. Split every polygonal path between non-trivial vertices (of degrees ) in the subtree into monotone subpaths by Algorithm 2.

Step 2b. Each monotone subpath of with endpoints (say) has the subcloud approximated by a polygonal path via points of by Steps 2c–2f.

Step 2c. For each subcloud of points ordered by their orthogonal projections to , start from and find the next index for by repeating Steps 2d–2e, which is possible by Lemma 12 in Appendix A.

Step 2d (exponential). Find the smallest index such that for , For every index , compute the distance orthogonally to the line segment as in Definition 11.

Step 2e (binary). Search for the maximum between and such that by dividing the range in 2 halves.

Step 2f. The found indices specify a polygonal path -approximating each monotone subcloud from Step 2b. Combine all these paths into a full skeleton.

Step 2g. Any edges of a length more than from Definition 5 are temporarily removed from the skeleton. Each remaining connected component with only short edges is collapsed to its center of mass. The resulting vertices are connected according to the temporarily removed edges to get the Approximate Skeleton .

For a cloud , mark the endpoints of all monotone subpaths in obtained by Algorithm 2. Consider all skeletons that have fixed vertices at the marked points of such that any polygonal path between fixed vertices (say ) is monotone under the orthogonal projection to the line segment .

The approximation problem for an error is to minimize the total number of vertices in a straight-line graph whose each monotone path should -approximate the corresponding subcloud of by the distance in Definition 11.

Theorem 8

Let be the minimum number of vertices over all graphs -approximating a given cloud . Then lies in and has at most vertices.

Theorem 8 estimates the number of vertices of when the geometric error is . In practice, the tree has an initial approximation error for a given cloud , because many points of may not be vertices of .

We measure the initial error by Definition 11 and take the maximum of over monotone paths of computed in Algorithm 2. Stage 2 approximates by a graph simpler than , but keeps the approximation error small. The error in Corollary 9 is , where is an error factor that takes values in the interval for the experiments in section 5.

Corollary 9

For any points and any error factor , an Approximate Skeleton within the -offset of the cloud (as in Definition 1) can be computed in time , where is the inverse Ackermann function, the constants are defined in Appendix A.

5 Comparisons of 5 algorithms on real and synthetic data

This section experimentally compares the Approximate Skeleton with those four skeletonization algorithms from section 2 that have theoretical guarantees and accept any cloud of points: Mapper Singh et al. (2007), Metric Graph Reconstruction MGR Aanjaneya et al. (2012), -Reeb graphs Chazal et al. (2015) and most recent discrete Morse theory (DMT) algorithm Dey et al. (2018).

The Mapper Singh et al. (2007) is very flexible in the sense that its parameters might be manually tuned for given data over numerous clustering algorithms. Having tried several possibilities, we have settled on the following choices from the original work Singh et al. (2007).

1) Convert a cloud into a connected neighborhood graph with Euclidean edge-lengths by using a distance threshold. The filter function is the distance function in from a root that is the furthest point from a random point in .

2) The image of the filter function is covered by 10 intervals with the 50% overlap so that splits into 10 subclouds when filter values are in one of the 10 intervals.

3) Each subcloud is clustered by the single-linkage clustering with the threshold the average edge-length of , where values of the factor are given in Table 10. The final Mapper graph has a single node representing each cluster.

The authors of the DMT algorithm by T. Dey et al. Dey et al. (2018) have kindly made their code available at Starting from an unorganized cloud of points, e.g. centers of molecules of a micelle, we generated scalar values at nodes of a regular grid required for the DMT algorithm.

1) We subdivide the axis-aligned bounding box of a cloud into small boxes: minimum 20 rectangular boxes (as close to cubic as possible) along each side.

2) The scalar values are found from the Kernel Density Estimate

at every grid node . The computed values are passed to the DMT with a parameter that regulates how small density values are replaced by 0.

The MGR algorithm has required much more efforts, because the original code was lost as confirmed by the main author of Aanjaneya et al. (2012). Since the algorithm was well-explained, we have implemented MGR ourselves and confirmed the earlier claim that “it is often hard to find suitable parameters” (Ge et al., 2011, page 3). Trying many values of the key parameter gave the zero success rate on the homeomorphism type.

Hence we have improved MGR by splitting this parameter into two: the first (values used in all experiments) was used for detecting vertex points, the second (three values in Table 10 experiments) was used for clustering points of different types. Only after using these different values, we have managed to push the success rates of MGR closer to on the homeomorphism type.

The -Reeb graph has the essential parameter whose values were tried in all experiments. has little dependence on the branching factor , e.g. all values produced almost identical results in Table 10 and Fig. 9.

Since three output graphs (Mapper, -Reeb and MGR) are abstract, to compute any geometric error of approximation, we map them to by sending every node of to the average point (center of mass) of the cluster (for Mapper and MGR) or subgraph (for -Reeb) corresponding to . Each link between nodes is mapped as a line segment between the corresponding points in .

Figures 58 show clouds and outputs of 5 algorithms. Since real micelles have irregular shapes in , their 2D projections may contain intersections of edges.

Table 10 shows the average results of the three algorithms on the dataset of more than 100 real micelles (clouds of about 300 molecules) whose endpoints and homeomorphism types were manually detected. A homeomorphism is a 1-1 continuous map with a continuous inverse, so a homeomorphism type is a stronger shape descriptor than a homotopy type, which counts only linearly independent cycles.

Figure 5: 1st: a cylindrical micelle with no branching vertices. 2nd: Mapper, 3rd: -Reeb, 4th: MGR, 5th: DMT, 6th: new .
Figure 6: 1st: a branched micelle with exactly one degree 3 vertex, 2nd: Mapper, 3rd: -Reeb, 4th: MGR, 5th: DMT, 6th: new .
Figure 7: 1st: ‘Christmas tree’ micelle with several degree 3 vertices). 2nd: Mapper, 3rd: -Reeb, 4th: MGR, 5th: DMT, 6th: new . All intersections come only from planar projections.
Figure 8: 1st: a random point sample around an 8-star in , 2nd: Mapper, 3rd: -Reeb, 4th: MGR, 5th: DMT, 6th: new .

The most important error measure for the tree reconstruction problem in section 1 is the success rate for detecting a correct homeomorphism type. Indeed, an incorrect graph can be perfect on other errors, e.g. is extremely fast, has the zero geometric error (for many distances between a cloud and a reconstructed graph) and even has a correct homotopy type (no cycles) for any underlying tree .

Hence the key results are in the middle column of Table 10 and the top right picture of Fig. 9. We included the success rate on the number of endpoints (degree 1 vertices) as a weaker topological error. As shows above, only if an algorithm performs well on a topological reconstruction, it makes sense to evaluate the performance on other measures such as geometric distances and time.

algorithm parameters endpoints success homeomorphism success time, ms max distance from
Mapper 54.21% 54.21% 18 4.59
Mapper 66.36% 66.36% 18 4.62
Mapper 68.20% 68.20% 19 4.67
MGR 48.60% 45.79% 25010 5.95
MGR 40.19% 40.19% 17410 5.51
MGR 29.91% 29.91% 25480 3.46
-Reeb 98.13% 98.13% 367 10.49
-Reeb 97.20% 97.20% 375 12.50
-Reeb 98.13% 98.13% 373 14.19
DMT 48.60% 45.79% 6290 5.95
DMT 40.19% 40.19% 6192 5.51
DMT 29.91% 29.91% 6410 3.46
98.13% 98.13% 42 5.16
98.13% 98.13% 42 5.16
97.20% 97.20% 42 5.31
Table 10: Columns 3-4 contain success rates for detecting the correct number of endpoints and a homeomorphism type of a graph over more than 100 real micelles. Column 5 contains the maximum Euclidean distance from points of a given cloud to the reconstructed graph .

Table 10 shows that the Mapper, MGR and DMT essentially depend on their parameters, because the success rates, run time and distance error significantly vary when the parameters are only slightly changed. The -Reeb and were stable, because Table 10 contains almost identical success rates for different parameters.

Both algorithms achieved best results on the most important measure of the homeomorphism success rate, though the top right picture in Fig. 9 highlights and MGR as the best for homeomorphism. In comparison with -Reeb and MGR, the Approximate Skeleton is much faster and achieves similar distance errors, see the relevant results in both Table 10 and Fig. 9.

Figure 9: For , each dot represents the average over 100 noisy samples around a random -star graph in and 3 parameters as in Table 10. Mapper: green and long-dashed, MGR: orange and sparsely dotted, -Reeb: red and densely dotted, DMT: olive and short-dashed, : blue and solid. Top left: the success rate in percentages for detecting a correct number of endpoints. Top right: the success rate in percentages for detecting the homeomorphism type. Bottom left (logarithmic scale): average run times in milliseconds. Bottom right: the max distance from a cloud to reconstructed graphs. The exact numbers are in the txt files in the supplementary materials.

In addition to the comparison on more than 100 micelles, we have tested the algorithms on the much larger dataset of synthetic clouds generated as follows.

1) An -star in has one vertex at and straight edges of length 100 to endpointsin random directions with a minimum angle between edges.

2) For and every of 100 random -stars , we found a minimum axis-aligned box containing , enlarged this box by the noise bound of 10%.

3) We uniformly chose a random point in the resulting box and checked if is at a distance at most 10 (=10% of edge-lengths) from . If successful, such points form a noisy sample of the ground truth -star .

Fig. 9 shows 4 plots for the 4 error measures of 5 algorithms, which were averaged over 3 values of essential parameters as in Table 10. The Mapper threshold factor for single-edge clustering was . The -Reeb scale was . The branching factor of was .

For the correct number of endpoints, the new skeleton achieves 100% results on the synthetic clouds, because Definition 5 provides a very stable concept of a deep vertex not critically depending on a branching factor . For the homeomorphism type, the minimum success of is 96%, because all short branches of are removed to get homeomorphic to an underlying tree.

For the random point sample of the 8-star graph in Fig. 8, the 2nd, 3rd and 5th graphs have several branched vertices instead of one. The 5th graph has several zigzags, which would be straightened in . The 4th graph has a triangular cycle because of incorrectly detected overlaps of clusters corresponding to vertices.

6 Conclusions and a discussion of the Approximate Skeleton

Though the current implementation was tested in , all steps and results work in any . Here is the summary of the key contributions to data skeletonization.

The detection of deep (branched) vertices in Definition 5 uses a global structure of longest paths within , hence is more stable under a change of parameters.

To improve the Metric Graph Reconstruction by M. Aanjaneya et al. Aanjaneya et al. (2012), we have split one parameter (used for detecting vertex points and also for clustering later) into two separate parameters (with default values) , , which led to more successful (20-40% rates instead of 0%) reconstructions in Table 10.

Theorem 8 proves the first size guarantees (on a small number of vertices) for the Approximate Skeleton , while all past methods from section 2 considered topological (mostly homotopy type) or metric properties of reconstructed graphs.

Corollary 9 says that the Approximate Skeleton can be quickly computed within a given error as required in the Tree Reconstruction Problem from section 1.

Because of the page limit the last author couldn’t include one more result on with realistic conditions on an underlying tree and its noisy sample to guarantee that and are homeomorphic to . This is the first advance after Giesen’s guarantees for shortest paths through sample points Giesen (1999) in 1999. The C++ code of is at

In comparison with the past methods in section 2, starts from the most challenging input (an unorganized cloud of points without any extra structure such a metric graph or a regular grid or a mesh), outputs an embedded graph in and provides two guarantees: combinatorial in Theorem 8 and geometric in Corollary 9 and topological. Appendix A has all missed proofs. Appendix B includes more experiments. We thank all reviewers for their helpful suggestions.


  • M. Aanjaneya, F. Chazal, D. Chen, M. Glisse, L. Guibas, and D. Morozov (2012) Metric graph reconstruction from noisy data. Int. J. Comp. Geometry Appl. 22, pp. 305–325. Cited by: §2, §2, §3, §5, §5, §6.
  • P. Agarwal, S. Har-Peled, N. Mustafa, and Y. Wang (2005) Near-linear time approximation algorithms for curve simplification. Algorithmica 42 (3-4), pp. 203–219. Cited by: §2.
  • A. Beygelzimer, S. Kakade, and J. Langford (2006) Cover trees for nearest neighbor. In Proceedings of ICML, pp. 97–104. Cited by: §7, Definition 14.
  • M. Carriere and S. Oudot (2017) Structure and stability of the one-dimensional mapper. Foundations of Computational Mathematics, pp. 1–64. Cited by: §2.
  • F. Chazal, R. Huang, and J. Sun (2015) Gromov-hausdorff approximation of filament structure using reeb-type graph. Discrete Computational Geometry 53, pp. 621–649. Cited by: §2, §2, §5.
  • O. Delgado-Friedrichs, V. Robins, and A. Sheppard (2015) Skeletonization and partitioning of digital images using discrete morse theory. Trans. PAMI 37(3), pp. 654–666. Cited by: §2.
  • T. K. Dey, J. Wang, and Y. Wang (2017) Improved road network reconstruction using discrete morse theory. In Proceedings of the 25th ACM SIGSPATIAL, pp. 58. Cited by: §2.
  • T. K. Dey, J. Wang, and Y. Wang (2018) Graph reconstruction by discrete morse theory. arXiv:1803.05093. Cited by: §2, §5, §5.
  • D. Douglas and T. Peucker (1973) Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica 10 (2), pp. 112–122. Cited by: §2, §4.
  • J. Forsythe, V. Kurlin, and A. Fitzgibbon (2016) Resolution-independent superpixels based on convex constrained meshes without small angles. In LNCS, Proceedings of ISVC 2016, Vol. 10072, pp. 223–233. Cited by: §2.
  • J. Forsythe and V. Kurlin (2017) Convex constrained meshes for superpixel segmentations of images. Journal of Electronic Imaging 26(6) (061609). Cited by: §2.
  • M. Fredman and R. Tarjan (1987) Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM 34 (3), pp. 596–615. Cited by: §7.
  • X. Ge, I. Safa, M. Belkin, and Y. Wang (2011) Data skeletonization via reeb graphs. In Proceedings of Neural Information Processing Systems, pp. 837–845. Cited by: §2, §2, §5.
  • J. Giesen (1999) Curve reconstruction, the traveling salesman problem and menger’s theorem on length. In Proceedings of SoCG, pp. 207–216. Cited by: §6.
  • A. Gorban and A. Zinovyev (2009) Principal graphs and manifolds. In

    Handbook of Research on Machine Learning Applications and Trends

    pp. 28–59. Cited by: §2.
  • S. Kalisnik, Kurlin,V., and Lesnik,D. (2019) A higher-dimensional homologically persistent skeleton. Advances in Applied Mathematics 102, pp. 113–142. Cited by: §2.
  • B. Kégl and A. Krzyzak (2002) Piecewise linear skeletonization using principal curves. Trans. Pattern Analysis Machine Intelligence 24, pp. 59–74. Cited by: §2.
  • V. Kurlin and D. Harvey (2018) Superpixels optimized by color and shape. 10746, pp. 297–311. Cited by: §2.
  • V. Kurlin and G. Muszynski (2020) Persistence-based resolution-independent meshes of superpixels. Pattern Recognition Letters, pp. 300–306. Cited by: §2.
  • Kurlin,V. (2015a) A homologically persistent skeleton is a fast and robust descriptor of interest points in 2d images. In Proceedings of CAIP, Vol. 9256, pp. 606–617. Cited by: §2.
  • Kurlin,V. (2015b) A one-dimensional homologically persistent skeleton of an unstructured point cloud in any metric space. Computer Graphics Forum 34 (5), pp. 253–262. Cited by: §2.
  • W. March, P. Ram, and A. Gray (2010) Fast euclidean minimum spanning tree: algorithm, analysis, and applications. In Proceedings of SIGKDD, pp. 603–612. Cited by: Theorem 4.
  • H. Shin and D.-S. Kim (2004) Optimal direction for monotone chain decomposition. In International Conference on Computational Science and Its Applications, pp. 583–591. Cited by: §4.
  • G. Singh, F. Memoli, and G. Carlsson (2007)

    Topological methods for the analysis of high dimensional data and 3d object recognition

    In Symp. Point-Based Graphics, pp. 91–100. Cited by: §2, §5, §5.
  • R. Singh, V. Cherkassky, and N. Papanikolopoulos (2000) Self-organizing maps for the skeletonization of sparse shapes.

    IEEE Tran. Neural Networks

    11, pp. 241–248.
    Cited by: §2.
  • S. Wang, Y. Wang, and Y. Li (2015) Efficient map reconstruction and augmentation via topological methods. In Proceedings of the 23rd SIGSPATIAL, pp. 25. Cited by: §2.
  • S.-T. Wu, A. da Silva, and M. Márquez (2004) The Douglas-Peucker algorithm: sufficiency conditions for non-self-intersections. J. Brazilian Computer Society 9 (3), pp. 67–84. Cited by: §4.

7 Appendix A: proofs of all the statements from section 4

The proof of Corollary 9 below uses Algorithm 1 for depths of vertices in from Definition 5. The depth is trivial (equal to 0) for any degree 1 vertex. For any other vertex , the depth can be recursively computed from lengths of edges at and depths of neighbors of . Imagine a water flow simultaneously starting from all degree 1 vertices of and moving towards internal vertices inside . At every vertex of degree , the flow waits until is reached from directions (edges at ), then the flow moves further in the remaining -th direction.

  Input: the initial tree
  Initialize Minimal Binary Heap of (vertex,depth)
  For all deg 1 vertices , add to ;
  while H is not empty do
      = H.pop(); // take the vertex of a min depth
     set = the only neighbor of in ;
     Remove the edge from , but keep ;
     Add to front of the list Neighbors;
     if  in  then
        Set ;
     else if  then
        Add the pair to the heap
     end if
  end while

  Initialize a vector[] depths; // a future output

  for  all vertices  do
     if  then
        Set depths[v] = 3rd element of Neighbors;
        Set depths[v] = 2nd element of Neighbors;
     end if
  end for
Algorithm 1 Computing depths of vertices from Definition 5 in Step 1c by ‘simultaneous flows’ moving from endpoints.

Definition 11 introduces a new distance between a cloud and a polygonal line. Recall that is the Euclidean distance in . We assume that the points are ordered by their orthogonal projections to the line .

Definition 11

For , let be the hyperspace that is orthogonal to and passes through . The distance between and is measured orthogonally to as , see Fig. 10. Consider the distance . For , the distance between and the polygonal line is defined as .

Figure 10: The distance from to in Definition 11 is measured orthogonally to .

Lemma 12 below justifies the steps of Stage 2 in section 4, which outputs the Approximate Skeleton starting from obtained in Stage 1 at the end of section 3.

Lemma 12

Let be points ordered according to their orthogonal projections to . For , one can find indices in time so that the estimates below for the distances in Definition 11 hold:

(a) for ,

(b) for any .

  Input: unordered set of points and line
  Output: Sequence of vertices that form monotone paths and containing ordering of points .
  Define to be ordering of obtained from projecting to orthogonaly.
  Define to be the map from points to their indices in .
  Define to be queue and add first point of to .
  Define to be the first point of .
  while Point is not the last point of  do
     Let point be the next point from in ordering
     while Point is not the last point of and  do
        Set to be .
        Set to be the next point from in ordering .
     end while
     Add to .
     if  is the last point in ordering  then
        Exit the program.
     end if
     while Point is not the last point of and  do
        Set to be .
        Set to be the next point from in ordering .
     end while
     Add to .
  end while
Algorithm 2 monotone subclouds of . For each point in a given cloud , find approximately its closest edge of . Then any edge has the edge-cloud of points that are closer to than to other edges of . For every polygonal path between non-trivial vertices define cloud and straight line spanned by points and . Run the algorithm above with parameters .

The following lemma is needed for Theorem 8 and is conveniently illustrated in Fig. 10 below. Recall that the distances from Definition 11 are computed orthogonally to the straight segment passing through the endpoints of a monotone point cloud .

Lemma 13

Let be points ordered according to their orthogonal projections to . Then for any indices .

Proof. For any , let be the hyperspace that is orthogonal to and passes through . Consider the intersection points and . Let , then . Since the points and are -close to the segment , the intermediate point is also -close to , i.e. . The triangle inequality implies that . Taking the maximum over , we get . ∎

Proof of Lemma 12. Assuming that indices were found, we search for the next index as follows. Search exponentially by trying indices for while .

Each evaluation of the distance requires time, because we need to compare distances to (orthogonally to ) from every point of between and .

After finding and such that and , we start a binary search for in the range each time choosing one half of the current range until both conditions (a)-(b) hold.

Finding the next index requires computations for the distance , where , hence time overall. Taking the sum over all , the total time is . ∎

Proof of Theorem 8. Since endpoints of all monotone polygonal paths of are fixed in minimization problem before Theorem 8, we separately consider every corresponding monotone subcloud of points (say) ordered by their orthogonal projections to the line through the line segment . Let be indices of an optimal -approximation (polygonal path) to . In the notations of Lemma 12 for the approximation error we will prove below that by induction on . Then and the size of the list , which is found in Lemma 12, is at most as required. Taking the sum of upper bounds over all monotone paths of , we conclude that the -Approximate Skeleton has the total number of vertices not greater than that that number for an -optimal skeleton .

The base means that , i.e. both paths start from the point . In the inductive step assume that . If , then and the inductive step is complete. The remaining case is . Since is an -approximation to , we have . Lemma 13 implies that for any index such that . Lemma 12(b) for the approximation says that , hence . ∎

Definition 14 (expanion constants)

Let be a cloud and be the closed ball with the center and radius . The expansion constant is the smallest real number such that . Let be the similarly defined constant for the metric space of line segments of , then set . Other constants are similarly defined in Beygelzimer et al. (2006).

Proof of Corollary 9. The distance from Definition 11 measured orthogonally to the straight line through fixed endpoints is not smaller than the Hausdorff distance used for -offsets in Definition 1. Hence the algorithm from Lemma 12 produces required -approximations in the sense of Definition 1. The current implementation uses the single-edge clustering based on , so Step 1a runs in time. The total time is dominated by Step 1b computing in time , where is the inverse Ackermann function.

Algorithm 1 in Step 1c has the pseudo-code above and maintains a binary tree on vertices, which requires time. Selecting deep vertices in Step 1d and finding longest paths in Step 1e within subtrees of needs time by classical algorithms Fredman and Tarjan (1987). Step 2a to split into subclouds is implemented by cover trees for line segments of in time as proved in Beygelzimer et al. (2006). By Lemma 12 Steps 2b-2g for approximating any subcloud of points by a polygonal path runs in time. Hence the total time at Stage 2 for computing over the cloud of points is . ∎

8 Appendix B: more qualitative comparisons of 3 algorithms

Fig. 1215 show example outputs of 3 algorithms on real and randomly generated clouds in . In almost all cases the Mapper and -Reeb graphs contain superfluous short edges, which affect the homeomorphism types.

The error factor from Corollary 9 affects the quality of approximation. Fig. 11 shows that higher values of lead to more straightened curves.

Figure 11: Left: for the branching factor . Middle: , Right: .
Figure 12: 1st: a sample around a 4-star in , 2nd: Mapper, 3rd: -Reeb, 4th: .
Figure 13: 1st: a sample around a 5-star in , 2nd: Mapper, 3rd: -Reeb, 4th: .
Figure 14: 1st: a sample around a 6-star in , 2nd: Mapper, 3rd: -Reeb, 4th: .
Figure 15: 1st: a sample around a 7-star in , 2nd: Mapper, 3rd: -Reeb, 4th: .
Figure 16: Explaining 1.87% failures of in Table 10: two micelles with short edges.