Query K-means Clustering and the Double Dixie Cup Problem

06/15/2018
by   I Chien, et al.
0

We consider the problem of approximate K-means clustering with outliers and side information provided by same-cluster queries and possibly noisy answers. Our solution shows that, under some mild assumptions on the smallest cluster size, one can obtain an (1+ϵ)-approximation for the optimal potential with probability at least 1-δ, where ϵ>0 and δ∈(0,1), using an expected number of O(K^3/ϵδ) noiseless same-cluster queries and comparison-based clustering of complexity O(ndK + K^3/ϵδ), here, n denotes the number of points and d the dimension of space. Compared to a handful of other known approaches that perform importance sampling to account for small cluster sizes, the proposed query technique reduces the number of queries by a factor of roughly O(K^6/ϵ^3), at the cost of possibly missing very small clusters. We extend this settings to the case where some queries to the oracle produce erroneous information, and where certain points, termed outliers, do not belong to any clusters. Our proof techniques differ from previous methods used for K-means clustering analysis, as they rely on estimating the sizes of the clusters and the number of points needed for accurate centroid estimation and subsequent nontrivial generalizations of the double Dixie cup problem. We illustrate the performance of the proposed algorithm both on synthetic and real datasets, including MNIST and CIFAR 10.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/02/2018

Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering

We study k-means clustering in a semi-supervised setting. Given an oracl...
research
12/19/2017

Approximate Correlation Clustering Using Same-Cluster Queries

Ashtiani et al. (NIPS 2016) introduced a semi-supervised framework for c...
research
08/29/2023

Clustering Without an Eigengap

We study graph clustering in the Stochastic Block Model (SBM) in the pre...
research
10/27/2021

Learning-Augmented k-means Clustering

k-means clustering is a well-studied problem due to its wide applicabili...
research
08/17/2021

Learning to Cluster via Same-Cluster Queries

We study the problem of learning to cluster data points using an oracle ...
research
09/11/2017

Semi-Supervised Active Clustering with Weak Oracles

Semi-supervised active clustering (SSAC) utilizes the knowledge of a dom...
research
01/12/2017

Light Source Point Cluster Selection Based Atmosphere Light Estimation

Atmosphere light value is a highly critical parameter in defogging algor...

Please sign up or login with your details

Forgot password? Click here to reset