A Prior for Record Linkage Based on Allelic Partitions

08/23/2020
by   Brenda Betancourt, et al.
0

In database management, record linkage aims to identify multiple records that correspond to the same individual. This task can be treated as a clustering problem, in which a latent entity is associated with one or more noisy database records. However, in contrast to traditional clustering applications, a large number of clusters with a few observations per cluster is expected in this context. In this paper, we introduce a new class of prior distributions based on allelic partitions that is specially suited for the small cluster setting of record linkage. Our approach makes it straightforward to introduce prior information about the cluster size distribution at different scales, and naturally enforces sublinear growth of the maximum cluster size -known as the microclustering property. We evaluate the performance of our proposed class of priors using three official statistics data sets and show that our models provide competitive results compared to state-of-the-art microclustering models in the record linkage literature.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/08/2017

Performance Bounds for Graphical Record Linkage

Record linkage involves merging records in large, noisy databases to rem...
research
09/06/2022

Fast Generation of Exchangeable Sequence of Clusters Data

Recent advances in Bayesian models for random partitions have led to the...
research
10/07/2013

Generalized Negative Binomial Processes and the Representation of Cluster Structures

The paper introduces the concept of a cluster structure to define a join...
research
10/08/2021

Multifile Partitioning for Record Linkage and Duplicate Detection

Merging datafiles containing information on overlapping sets of entities...
research
09/30/2020

Maximum Entropy classification for record linkage

By record linkage one joins records residing in separate files which are...
research
06/18/2020

Record fusion: A learning approach

Record fusion is the task of aggregating multiple records that correspon...
research
07/06/2018

Temporal graph-based clustering for historical record linkage

Research in the social sciences is increasingly based on large and compl...

Please sign up or login with your details

Forgot password? Click here to reset