Correlation Clustering with Adaptive Similarity Queries

05/28/2019
by   Marco Bressan, et al.
0

We investigate learning algorithms that use similarity queries to approximately solve correlation clustering problems. The input consists of n objects; each pair of objects has a hidden binary similarity score that we can learn through a query. The goal is to use as few queries as possible to partition the objects into clusters so to achieve the optimal number OPT of disagreements with the scores. Our first set of contributions is algorithmic: we introduce ACC, a simple query-aware variant of an existing algorithm (KwikCluster, with expected error 3OPT but a vacuous O(n^2) worst-case bound on the number of queries) for which we prove several desirable properties. First, ACC has expected error 3OPT + O(n^3/Q) when using Q < n2 queries, and recovers KwikCluster's bound of 3OPT for Q=n2. Second, ACC accurately recovers every adversarially perturbed latent cluster C. Under stronger conditions on C, ACC can even be used to recover exactly all clusters with high probability. Third, we show an efficient variant, , with the same expected error as ACC but using significantly less queries on some graphs. We empirically test our algorithms on real-world and synthetic datasets. Our second set of contributions is a nearly complete information-theoretic characterization of the query vs. error trade-off. First, using VC theory, for all Q = Ω(n) we prove the existence of algorithms with expected error at most OPT+ n^5/2/√(Q), and at most O(n^3/Q) if OPT=0. We then show that any randomized algorithm, when using at most Q queries, must output a clustering with expected cost OPT+ Ω(n^3/Q), which matches the upper bound for Q=Θ(n). For the special case of OPT=0 we prove a weaker lower bound of Ω(n^2/√(Q)).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/26/2020

Query-Efficient Correlation Clustering

Correlation clustering is arguably the most natural formulation of clust...
research
10/28/2019

Same-Cluster Querying for Overlapping Clusters

Overlapping clusters are common in models of many practical data-segment...
research
05/22/2023

Error-Tolerant Exact Query Learning of Finite Set Partitions with Same-Cluster Oracle

This paper initiates the study of active learning for exact recovery of ...
research
06/08/2016

Clustering with Same-Cluster Queries

We propose a framework for Semi-Supervised Active Clustering framework (...
research
06/30/2011

A Comprehensive Trainable Error Model for Sung Music Queries

We propose a model for errors in sung queries, a variant of the hidden M...
research
11/18/2018

Information Theoretic Bound on Optimal Worst-case Error in Binary Mixture Identification

Identification of latent binary sequences from a pool of noisy observati...
research
08/10/2021

Correlation Clustering Reconstruction in Semi-Adversarial Models

Correlation Clustering is an important clustering problem with many appl...

Please sign up or login with your details

Forgot password? Click here to reset