Scaling Bayesian Probabilistic Record Linkage with Post-Hoc Blocking: An Application to the California Great Registers

05/14/2019
by   Brendan S. McVeigh, et al.
0

Probabilistic record linkage (PRL) is the process of determining which records in two databases correspond to the same underlying entity in the absence of a unique identifier. Bayesian solutions to this problem provide a powerful mechanism for propagating uncertainty due to uncertain links between records (via the posterior distribution). However, computational considerations severely limit the practical applicability of existing Bayesian approaches. We propose a new computational approach, providing both a fast algorithm for deriving point estimates of the linkage structure that properly account for one-to-one matching and a restricted MCMC algorithm that samples from an approximate posterior distribution. Our advances make it possible to perform Bayesian PRL for larger problems, and to assess the sensitivity of results to varying prior specifications. We demonstrate the methods on a subset of an OCR'd dataset, the California Great Registers, a collection of 57 million voter registrations from 1900 to 1968 that comprise the only panel data set of party registration collected before the advent of scientific surveys.

READ FULL TEXT
research
09/13/2019

d-blink: Distributed End-to-End Bayesian Entity Resolution

Entity resolution (ER) (record linkage or de-duplication) is the process...
research
07/13/2023

Fast Bayesian Record Linkage for Streaming Data Contexts

Record linkage is the task of combining records from multiple files whic...
research
12/01/2020

A Bayesian Approach to Linking Data Without Unique Identifiers

Existing file linkage methods may produce sub-optimal results because th...
research
09/25/2019

Bayesian Pseudo Posterior Mechanism under Differential Privacy

We propose a Bayesian pseudo posterior mechanism to generate record-leve...
research
10/11/2018

Generalized Bayesian Record Linkage and Regression with Exact Error Propagation

Record linkage (de-duplication or entity resolution) is the process of m...
research
10/02/2018

Posterior Prototyping: Bridging the Gap between Bayesian Record Linkage and Regression

Record linkage (entity resolution or de-deduplication) is the process of...
research
04/29/2015

Probabilistic Depth Image Registration incorporating Nonvisual Information

In this paper, we derive a probabilistic registration algorithm for obje...

Please sign up or login with your details

Forgot password? Click here to reset