Clustering Permutations: New Techniques with Streaming Applications

12/04/2022
by   Diptarka Chakraborty, et al.
0

We study the classical metric k-median clustering problem over a set of input rankings (i.e., permutations), which has myriad applications, from social-choice theory to web search and databases. A folklore algorithm provides a 2-approximate solution in polynomial time for all k=O(1), and works irrespective of the underlying distance measure, so long it is a metric; however, going below the 2-factor is a notorious challenge. We consider the Ulam distance, a variant of the well-known edit-distance metric, where strings are restricted to be permutations. For this metric, Chakraborty, Das, and Krauthgamer [SODA, 2021] provided a (2-δ)-approximation algorithm for k=1, where δ≈ 2^-40. Our primary contribution is a new algorithmic framework for clustering a set of permutations. Our first result is a 1.999-approximation algorithm for the metric k-median problem under the Ulam metric, that runs in time (k log (nd))^O(k)n d^3 for an input consisting of n permutations over [d]. In fact, our framework is powerful enough to extend this result to the streaming model (where the n input permutations arrive one by one) using only polylogarithmic (in n) space. Additionally, we show that similar results can be obtained even in the presence of outliers, which is presumably a more difficult problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2020

k-Median clustering under discrete Fréchet and Hausdorff distances

We give the first near-linear time (1+)-approximation algorithm for k-me...
research
12/19/2018

Approximation Schemes for Capacitated Clustering in Doubling Metrics

Motivated by applications in redistricting, we consider the uniform capa...
research
04/19/2021

Coresets for (k, ℓ)-Median Clustering under the Fréchet Distance

We present an algorithm for computing ϵ-coresets for (k, ℓ)-median clust...
research
11/02/2020

Approximating the Median under the Ulam Metric

We study approximation algorithms for variants of the median string prob...
research
07/17/2019

Improved Algorithms for Time Decay Streams

In the time-decay model for data streams, elements of an underlying data...
research
04/29/2019

Soft edit distance for differentiable comparison of symbolic sequences

Edit distance, also known as Levenshtein distance, is an essential way t...
research
03/21/2018

Similar Elements and Metric Labeling on Complete Graphs

We consider a problem that involves finding similar elements in a collec...

Please sign up or login with your details

Forgot password? Click here to reset