Posterior Prototyping: Bridging the Gap between Bayesian Record Linkage and Regression

10/02/2018
by   Andee Kaplan, et al.
0

Record linkage (entity resolution or de-deduplication) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from the data, many researchers are interested in performing inference, prediction or post-linkage analysis on the linked data, which we call the downstream task. Depending on the downstream task, one may wish to find the most representative record before performing the post-linkage analysis. Motivated by the downstream task, we propose first performing record linkage using a Bayesian model and then choosing representative records through prototyping. Given the information about the representative records, we then explore two downstream tasks - linear regression and binary classification via logistic regression. In addition, we explore how error propagation occurs in both of these settings. We provide thorough empirical studies for our proposed methodology, and conclude with a discussion of practical insights into our work.

READ FULL TEXT

page 14

page 15

page 16

page 19

research
10/11/2018

Generalized Bayesian Record Linkage and Regression with Exact Error Propagation

Record linkage (de-duplication or entity resolution) is the process of m...
research
03/08/2017

Performance Bounds for Graphical Record Linkage

Record linkage involves merging records in large, noisy databases to rem...
research
01/08/2023

Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors

Entity resolution (record linkage or deduplication) is the process of id...
research
09/01/2020

Invited Discussion of "A Unified Framework for De-Duplication and Population Size Estimation"

Invited Discussion of "A Unified Framework for De-Duplication and Popula...
research
10/08/2021

Multifile Partitioning for Record Linkage and Duplicate Detection

Merging datafiles containing information on overlapping sets of entities...
research
05/14/2019

Scaling Bayesian Probabilistic Record Linkage with Post-Hoc Blocking: An Application to the California Great Registers

Probabilistic record linkage (PRL) is the process of determining which r...
research
09/30/2020

Maximum Entropy classification for record linkage

By record linkage one joins records residing in separate files which are...

Please sign up or login with your details

Forgot password? Click here to reset