Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering

by   Buddhima Gamlath, et al.

We study k-means clustering in a semi-supervised setting. Given an oracle that returns whether two given points belong to the same cluster in a fixed optimal clustering, we investigate the following question: how many oracle queries are sufficient to efficiently recover a clustering that, with probability at least (1 - δ), simultaneously has a cost of at most (1 + ϵ) times the optimal cost and an accuracy of at least (1 - ϵ)? We show how to achieve such a clustering on n points with O((k^2 n) · m(Q, ϵ^4, δ/(k n))) oracle queries, when the k clusters can be learned with an ϵ' error and a failure probability δ' using m(Q, ϵ',δ') labeled samples, where Q is the set of candidate cluster centers. We show that m(Q, ϵ', δ') is small both for k-means instances in Euclidean space and for those in finite metric spaces. We further show that, for the Euclidean k-means instances, we can avoid the dependency on n in the query complexity at the expense of an increased dependency on k: specifically, we give a slightly more involved algorithm that uses O( k^4/(ϵ^2 δ) + (k^9/ϵ^4) (1/δ) + k · m(Q, ϵ^4/k, δ)) oracle queries. Finally, we show that the number of queries required for (1 - ϵ)-accuracy in Euclidean k-means must linearly depend on the dimension of the underlying Euclidean space, whereas, for finite metric space k-means, this number must at least be logarithmic in the number of candidate centers. This shows that our query complexities capture the right dependencies on the respective parameters.


page 1

page 2

page 3

page 4


Semi-Supervised Active Clustering with Weak Oracles

Semi-supervised active clustering (SSAC) utilizes the knowledge of a dom...

Query K-means Clustering and the Double Dixie Cup Problem

We consider the problem of approximate K-means clustering with outliers ...

Relaxed Oracles for Semi-Supervised Clustering

Pairwise "same-cluster" queries are one of the most widely used forms of...

Exact Recovery of Mangled Clusters with Same-Cluster Queries

We study the problem of recovering distorted clusters in the semi-superv...

On Margin-Based Cluster Recovery with Oracle Queries

We study an active cluster recovery problem where, given a set of n poin...

Improved Learning-augmented Algorithms for k-means and k-medians Clustering

We consider the problem of clustering in the learning-augmented setting,...

Fuzzy Clustering with Similarity Queries

The fuzzy or soft k-means objective is a popular generalization of the w...

Please sign up or login with your details

Forgot password? Click here to reset