Large Scale Record Linkage in the Presence of Missing Data

04/19/2021
by   Thilina Ranbaduge, et al.
0

Record linkage is aimed at the accurate and efficient identification of records that represent the same entity within or across disparate databases. It is a fundamental task in data integration and increasingly required for accurate decision making in application domains ranging from health analytics to national security. Traditional record linkage techniques calculate string similarities between quasi-identifying (QID) values, such as the names and addresses of people. Errors, variations, and missing QID values can however lead to low linkage quality because the similarities between records cannot be calculated accurately. To overcome this challenge, we propose a novel technique that can accurately link records even when QID values contain errors or variations, or are missing. We first generate attribute signatures (concatenated QID values) using an Apriori based selection of suitable QID attributes, and then relational signatures that encapsulate relationship information between records. Combined, these signatures can uniquely identify individual records and facilitate fast and high quality linking of very large databases through accurate similarity calculations between records. We evaluate the linkage quality and scalability of our approach using large real-world databases, showing that it can achieve high linkage quality even when the databases being linked contain substantial amounts of missing values and errors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/13/2016

Application of Advanced Record Linkage Techniques for Complex Population Reconstruction

Record linkage is the process of identifying records that refer to the s...
research
11/03/2022

Privacy-preserving Deep Learning based Record Linkage

Deep learning-based linkage of records across different databases is bec...
research
02/15/2023

A Case Study on Record Matching of Individuals in Historical Archives of Indigenous Databases

Digitization of historical records has produced a significant amount of ...
research
07/06/2018

Temporal graph-based clustering for historical record linkage

Research in the social sciences is increasingly based on large and compl...
research
02/16/2021

VIEW: a framework for organization level interactive record linkage to support reproducible data science

Objective: To design and evaluate a general framework for interactive re...
research
12/27/2017

Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

Accurate and efficient entity resolution is an open challenge of particu...
research
11/13/2018

Personal Names Popularity Estimation and its Application to Record Linkage

This study deals with a fairly simply formulated problem -- how to estim...

Please sign up or login with your details

Forgot password? Click here to reset