On Efficient Low Distortion Ultrametric Embedding

08/15/2020
by   Vincent Cohen-Addad, et al.
0

A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic linkage algorithms (single, average, or complete). However, these methods on a data set of n points in Ω(log n) dimensions exhibit a quite prohibitive running time of Θ(n^2). In this paper, we provide a new algorithm which takes as input a set of points P in ℝ^d, and for every c≥ 1, runs in time n^1+ρ/c^2 (for some universal constant ρ>1) to output an ultrametric Δ such that for any two points u,v in P, we have Δ(u,v) is within a multiplicative factor of 5c to the distance between u and v in the "best" ultrametric representation of P. Here, the best ultrametric is the ultrametric Δ̃ that minimizes the maximum distance distortion with respect to the ℓ_2 distance, namely that minimizes u,v ∈ Pmax Δ̃(u,v)/u-v_2. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant ε>0, no algorithm with running time n^2-ε can distinguish between inputs in ℓ_∞-metric that admit isometric embedding and those that incur a distortion of 3/2. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.

READ FULL TEXT
research
06/03/2023

On the Budgeted Hausdorff Distance Problem

Given a set P of n points in the plane, and a parameter k, we present...
research
01/04/2022

Polyline Simplification under the Local Fréchet Distance has Subcubic Complexity in 2D

Given a polyline on n vertices, the polyline simplification problem asks...
research
06/10/2021

Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time

We study the widely used hierarchical agglomerative clustering (HAC) alg...
research
04/16/2020

Coresets for Clustering in Excluded-minor Graphs and Beyond

Coresets are modern data-reduction tools that are widely used in data an...
research
03/14/2021

Faster Algorithms for Largest Empty Rectangles and Boxes

We revisit a classical problem in computational geometry: finding the la...
research
11/07/2020

A Gap-ETH-Tight Approximation Scheme for Euclidean TSP

We revisit the classic task of finding the shortest tour of n points in ...
research
12/12/2021

Maintaining AUC and H-measure over time

Measuring the performance of a classifier is a vital task in machine lea...

Please sign up or login with your details

Forgot password? Click here to reset