Scalable K-Medoids via True Error Bound and Familywise Bandits

05/27/2019
by   Aravindakshan Babu, et al.
0

K-Medoids(KM) is a standard clustering method, used extensively on semi-metric data. Error analyses of KM have traditionally used an in-sample notion of error, which can be far from the true error and suffer from generalization error. We formalize the true K-Medoid error based on the underlying data distribution, by decomposing it into fundamental statistical problems of: minimum estimation (ME) and minimum mean estimation (MME). We provide a convergence result for MME and bound the true KM error for iid data. Inspired by this bound, we propose a computationally efficient, distributed KM algorithm namely MCPAM. MCPAM has expected runtime O(km) and provides massive computational savings for a small tradeoff in accuracy. We verify the quality and scaling properties of MCPAM on various datasets. And achieve the hitherto unachieved feat of calculating the KM of 1 billion points on semi-metric spaces.

READ FULL TEXT
research
09/17/2022

Robust Online and Distributed Mean Estimation Under Adversarial Data Corruption

We study robust mean estimation in an online and distributed scenario in...
research
05/30/2014

Learning to Act Greedily: Polymatroid Semi-Bandits

Many important optimization problems, such as the minimum spanning tree ...
research
06/24/2015

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality

We study the tradeoff between the statistical error and communication co...
research
08/05/2021

Q-error Bounds of Random Uniform Sampling for Cardinality Estimation

Random uniform sampling has been studied in various statistical tasks bu...
research
06/18/2019

A twin error gauge for Kaczmarz's iterations

We propose two new methods based on Kaczmarz's method that produce a reg...
research
02/10/2020

Robust Mean Estimation under Coordinate-level Corruption

Data corruption, systematic or adversarial, may skew statistical estimat...
research
03/11/2019

Automated Circuit Approximation Method Driven by Data Distribution

We propose an application-tailored data-driven fully automated method fo...

Please sign up or login with your details

Forgot password? Click here to reset