Clustering with Noisy Queries

06/22/2017
by   Arya Mazumdar, et al.
0

In this paper, we initiate a rigorous theoretical study of clustering with noisy queries (or a faulty oracle). Given a set of n elements, our goal is to recover the true clustering by asking minimum number of pairwise queries to an oracle. Oracle can answer queries of the form : "do elements u and v belong to the same cluster?" -- the queries can be asked interactively (adaptive queries), or non-adaptively up-front, but its answer can be erroneous with probability p. In this paper, we provide the first information theoretic lower bound on the number of queries for clustering with noisy oracle in both situations. We design novel algorithms that closely match this query complexity lower bound, even when the number of clusters is unknown. Moreover, we design computationally efficient algorithms both for the adaptive and non-adaptive settings. The problem captures/generalizes multiple application scenarios. It is directly motivated by the growing body of work that use crowdsourcing for entity resolution, a fundamental and challenging data mining task aimed to identify all records in a database referring to the same entity. Here crowd represents the noisy oracle, and the number of queries directly relates to the cost of crowdsourcing. Another application comes from the problem of sign edge prediction in social network, where social interactions can be both positive and negative, and one must identify the sign of all pair-wise interactions by querying a few pairs. Furthermore, clustering with noisy oracle is intimately connected to correlation clustering, leading to improvement therein. Finally, it introduces a new direction of study in the popular stochastic block model where one has an incomplete stochastic block model matrix to recover the clusters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/23/2017

Query Complexity of Clustering with Side Information

Suppose, we are given a set of n elements to be clustered into k (unknow...
research
06/18/2021

Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with a Faulty Oracle

Motivated by applications in crowdsourced entity resolution in database,...
research
03/31/2019

Semisupervised Clustering by Queries and Locally Encodable Source Coding

Source coding is the canonical problem of data compression in informatio...
research
02/03/2017

A Theoretical Analysis of First Heuristics of Crowdsourced Entity Resolution

Entity resolution (ER) is the task of identifying all records in a datab...
research
09/19/2017

Predicting Positive and Negative Links with Noisy Queries: Theory & Practice

Social networks and interactions in social media involve both positive a...
research
09/21/2019

Optimal Learning of Joint Alignments with a Faulty Oracle

We consider the following problem, which is useful in applications such ...
research
06/09/2022

Clustering with Queries under Semi-Random Noise

The seminal paper by Mazumdar and Saha <cit.> introduced an extensive li...

Please sign up or login with your details

Forgot password? Click here to reset