
Correlation Clustering with SameCluster Queries Bounded by Optimal Cost
Several clustering frameworks with interactive (semisupervised) queries...
read it

Exact Recovery of Mangled Clusters with SameCluster Queries
We study the problem of recovering distorted clusters in the semisuperv...
read it

Bipartite Correlation Clustering  Maximizing Agreements
In Bipartite Correlation Clustering (BCC) we are given a complete bipart...
read it

Who and Where: People and Location CoClustering
In this paper, we consider the clustering problem on images where each i...
read it

Relaxed Oracles for SemiSupervised Clustering
Pairwise "samecluster" queries are one of the most widely used forms of...
read it

Improved algorithms for Correlation Clustering with local objectives
Correlation Clustering is a powerful graph partitioning model that aims ...
read it

A Prior for Record Linkage Based on Allelic Partitions
In database management, record linkage aims to identify multiple records...
read it
Semisupervised clustering for deduplication
Data deduplication is the task of detecting multiple records that correspond to the same realworld entity in a database. In this work, we view deduplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters. We introduce a framework which we call promise correlation clustering. Given a complete graph G with the edges labelled 0 and 1, the goal is to find a clustering that minimizes the number of 0 edges within a cluster plus the number of 1 edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph G^* with edges corresponding to points in the same cluster being labelled 0 and other edges being labelled 1. Under the promise that the edge difference between G and G^* is "small", we prove that finding the optimal clustering (or G^*) is still NPHard. [Ashtiani et. al, 2016] introduced the framework of semisupervised clustering, where the learning algorithm has access to an oracle, which answers whether two points belong to the same or different clusters. We further prove that even with access to a samecluster oracle, the promise version is NPHard as long as the number queries to the oracle is not too large (o(n) where n is the number of vertices). Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class F of clusterings. We offer a semisupervised algorithmic approach to solve the restricted variant with success guarantees.
READ FULL TEXT
Comments
There are no comments yet.