Generalized Bayesian Record Linkage and Regression with Exact Error Propagation

10/11/2018
by   Rebecca C. Steorts, et al.
0

Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive, or post-linkage task on the linked data. One goal of the downstream task is obtaining a larger reference data set, allowing one to perform more accurate statistical analyses. In addition, there is inherent record linkage uncertainty passed to the downstream task. Motivated by the above, we propose a generalized Bayesian record linkage method and consider multiple regression analysis as the downstream task. Records are linked via a random partition model, which allows for a wide class to be considered. In addition, we jointly model the record linkage and downstream task, which allows one to account for the record linkage uncertainty exactly. Moreover, one is able to generate a feedback propagation mechanism of the information from the proposed Bayesian record linkage model into the downstream task. This feedback effect is essential to eliminate potential biases that can jeopardize resulting downstream task. We apply our methodology to multiple linear regression, and illustrate empirically that the "feedback effect" is able to improve the performance of record linkage.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/02/2018

Posterior Prototyping: Bridging the Gap between Bayesian Record Linkage and Regression

Record linkage (entity resolution or de-deduplication) is the process of...
research
03/08/2017

Performance Bounds for Graphical Record Linkage

Record linkage involves merging records in large, noisy databases to rem...
research
01/25/2016

Bayesian Estimation of Bipartite Matchings for Record Linkage

The bipartite record linkage task consists of merging two disparate data...
research
01/08/2023

Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors

Entity resolution (record linkage or deduplication) is the process of id...
research
06/01/2023

A General Framework for Regression with Mismatched Data Based on Mixture Modeling

Data sets obtained from linking multiple files are frequently affected b...
research
09/13/2019

d-blink: Distributed End-to-End Bayesian Entity Resolution

Entity resolution (ER) (record linkage or de-duplication) is the process...
research
05/14/2019

Scaling Bayesian Probabilistic Record Linkage with Post-Hoc Blocking: An Application to the California Great Registers

Probabilistic record linkage (PRL) is the process of determining which r...

Please sign up or login with your details

Forgot password? Click here to reset