On Efficient Low Distortion Ultrametric Embedding

by   Vincent Cohen-Addad, et al.

A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic linkage algorithms (single, average, or complete). However, these methods on a data set of n points in Ω(log n) dimensions exhibit a quite prohibitive running time of Θ(n^2). In this paper, we provide a new algorithm which takes as input a set of points P in ℝ^d, and for every c≥ 1, runs in time n^1+ρ/c^2 (for some universal constant ρ>1) to output an ultrametric Δ such that for any two points u,v in P, we have Δ(u,v) is within a multiplicative factor of 5c to the distance between u and v in the "best" ultrametric representation of P. Here, the best ultrametric is the ultrametric Δ̃ that minimizes the maximum distance distortion with respect to the ℓ_2 distance, namely that minimizes u,v ∈ Pmax Δ̃(u,v)/u-v_2. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant ε>0, no algorithm with running time n^2-ε can distinguish between inputs in ℓ_∞-metric that admit isometric embedding and those that incur a distortion of 3/2. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.



page 16


Polyline Simplification under the Local Fréchet Distance has Subcubic Complexity in 2D

Given a polyline on n vertices, the polyline simplification problem asks...

Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time

We study the widely used hierarchical agglomerative clustering (HAC) alg...

Coresets for Clustering in Excluded-minor Graphs and Beyond

Coresets are modern data-reduction tools that are widely used in data an...

Faster Algorithms for Largest Empty Rectangles and Boxes

We revisit a classical problem in computational geometry: finding the la...

A Gap-ETH-Tight Approximation Scheme for Euclidean TSP

We revisit the classic task of finding the shortest tour of n points in ...

Asymptotic Improvements on the Exact Matching Distance for 2-parameter Persistence

In the field of topological data analysis, persistence modules are used ...

ASYMP: Fault-tolerant Mining of Massive Graphs

We present ASYMP, a distributed graph processing system developed for th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.