Semi-supervised clustering for de-duplication

10/10/2018
by   Shrinu Kushagra, et al.
0

Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters. We introduce a framework which we call promise correlation clustering. Given a complete graph G with the edges labelled 0 and 1, the goal is to find a clustering that minimizes the number of 0 edges within a cluster plus the number of 1 edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph G^* with edges corresponding to points in the same cluster being labelled 0 and other edges being labelled 1. Under the promise that the edge difference between G and G^* is "small", we prove that finding the optimal clustering (or G^*) is still NP-Hard. [Ashtiani et. al, 2016] introduced the framework of semi-supervised clustering, where the learning algorithm has access to an oracle, which answers whether two points belong to the same or different clusters. We further prove that even with access to a same-cluster oracle, the promise version is NP-Hard as long as the number queries to the oracle is not too large (o(n) where n is the number of vertices). Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class F of clusterings. We offer a semi-supervised algorithmic approach to solve the restricted variant with success guarantees.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/14/2019

Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost

Several clustering frameworks with interactive (semi-supervised) queries...
research
06/08/2020

Exact Recovery of Mangled Clusters with Same-Cluster Queries

We study the problem of recovering distorted clusters in the semi-superv...
research
12/19/2017

Approximate Correlation Clustering Using Same-Cluster Queries

Ashtiani et al. (NIPS 2016) introduced a semi-supervised framework for c...
research
01/10/2019

An MBO scheme for clustering and semi-supervised clustering of signed networks

We introduce a principled method for the signed clustering problem, wher...
research
03/09/2016

Bipartite Correlation Clustering -- Maximizing Agreements

In Bipartite Correlation Clustering (BCC) we are given a complete bipart...
research
07/31/2013

Who and Where: People and Location Co-Clustering

In this paper, we consider the clustering problem on images where each i...
research
11/20/2017

Relaxed Oracles for Semi-Supervised Clustering

Pairwise "same-cluster" queries are one of the most widely used forms of...

Please sign up or login with your details

Forgot password? Click here to reset