Graph-based hierarchical record clustering for unsupervised entity resolution

12/12/2021
by   Islam Akef Ebeid, et al.
0

Here we study the problem of matched record clustering in unsupervised entity resolution. We build upon a state-of-the-art probabilistic framework named the Data Washing Machine (DWM). We introduce a graph-based hierarchical 2-step record clustering method (GDWM) that first identifies large, connected components or, as we call them, soft clusters in the matched record pairs using a graph-based transitive closure algorithm utilized in the DWM. That is followed by breaking down the discovered soft clusters into more precise entity clusters in a hierarchical manner using an adapted graph-based modularity optimization method. Our approach provides several advantages over the original implementation of the DWM, mainly a significant speed-up, increased precision, and overall increased F1 scores. We demonstrate the efficacy of our approach using experiments on multiple synthetic datasets. Our results also provide evidence of the utility of graph theory-based algorithms despite their sparsity in the literature on unsupervised entity resolution.

READ FULL TEXT

page 4

page 6

research
09/06/2019

Graph-based data clustering via multiscale community detection

We present a graph-theoretical approach to data clustering, which combin...
research
05/21/2019

Clustering with Similarity Preserving

Graph-based clustering has shown promising performance in many tasks. A ...
research
06/05/2018

Hierarchical Graph Clustering using Node Pair Sampling

We present a novel hierarchical graph clustering algorithm inspired by m...
research
07/12/2012

A Hierarchical Graphical Model for Record Linkage

The task of matching co-referent records is known among other names as r...
research
03/12/2019

Learning Resolution Parameters for Graph Clustering

Finding clusters of well-connected nodes in a graph is an extensively st...
research
08/10/2020

(Almost) All of Entity Resolution

Whether the goal is to estimate the number of people that live in a cong...
research
06/04/2016

Improving Coreference Resolution by Learning Entity-Level Distributed Representations

A long-standing challenge in coreference resolution has been the incorpo...

Please sign up or login with your details

Forgot password? Click here to reset