AutoER: Automated Entity Resolution using Generative Modelling

08/16/2019
by   Renzhi Wu, et al.
0

Entity resolution (ER) refers to the problem of identifying records in one or more relations that refer to the same real-world entity. ER has been extensively studied by the database community with supervised machine learning approaches achieving the state-of-the-art results. However, supervised ML requires many labeled examples, both matches and unmatches, which are expensive to obtain. In this paper, we investigate an important problem: how can we design an unsupervised algorithm for ER that can achieve performance comparable to supervised approaches? We propose an automated ER solution, AutoER, that requires zero labeled examples. Our central insight is that the similarity vectors for matches should look different from that of unmatches. A number of innovations are needed to translate the intuition into an actual algorithm for ER. We advocate for the use of generative models to capture the two similarity vector distributions (the match distribution and the unmatch distribution). We propose an expectation maximization based algorithm to learn the model parameters. Our algorithm addresses many practical challenges including feature correlations, model overfitting, class imbalance, and transitivity between matches. On six datasets from four different domains, we show that the performance of AutoER is comparable and sometimes even better than supervised ML approaches.

READ FULL TEXT
research
11/13/2022

Ground Truth Inference for Weakly Supervised Entity Matching

Entity matching (EM) refers to the problem of identifying pairs of data ...
research
12/18/2020

ErGAN: Generative Adversarial Networks for Entity Resolution

Entity resolution targets at identifying records that represent the same...
research
09/28/2018

Reuse and Adaptation for Entity Resolution through Transfer Learning

Entity resolution (ER) is one of the fundamental problems in data integr...
research
09/10/2015

Performance Bounds for Pairwise Entity Resolution

One significant challenge to scaling entity resolution algorithms to mas...
research
08/18/2021

CollaborER: A Self-supervised Entity Resolution Framework Using Multi-features Collaboration

Entity Resolution (ER) aims to identify whether two tuples refer to the ...
research
07/08/2022

Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation

Machine learning (ML) is playing an increasingly important role in data ...
research
03/11/2018

Entity Resolution and Federated Learning get a Federated Resolution

Consider two data providers, each maintaining records of different featu...

Please sign up or login with your details

Forgot password? Click here to reset