Exploiting Redundancy, Recurrence and Parallelism: How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes

08/04/2017
by   Yuhang Zhang, et al.
0

Accurate and efficient record linkage is an open challenge of particular relevance to Australian Government Agencies, who recognise that so-called wicked social problems are best tackled by forming partnerships founded on large-scale data fusion. Names and addresses are the most common attributes on which data from different government agencies can be linked. In this paper, we focus on the problem of address linking. Linkage is particularly problematic when the data has significant quality issues. The most common approach for dealing with quality issues is to standardise raw data prior to linking. If a mistake is made in standardisation, however, it is usually impossible to recover from it to perform linkage correctly. This paper proposes a novel algorithm for address linking that is particularly practical for linking large disparate sets of addresses, being highly scalable, robust to data quality issues and simple to implement. It obviates the need for labour intensive and problematic address standardisation. We demonstrate the efficacy of the algorithm by matching two large address datasets from two government agencies with good accuracy and computational efficiency.

READ FULL TEXT
research
02/21/2020

A Joint Bayesian Framework for Causal Inference and Bipartite Matching for Record Linkage

The recent proliferation in the use of digital health data has opened po...
research
01/15/2019

Assessing the accuracy of record linkages with Markov chain based Monte Carlo simulation approach

Record linkage is the process of finding matches and linking records fro...
research
03/12/2020

Improved assessment of the accuracy of record linkage via an extended MaCSim approach

Record linkage is the process of bringing together the same entity from ...
research
08/14/2020

Challenges of Linking Organizational Information in Open Government Data to Knowledge Graphs

Open Government Data (OGD) is being published by various public administ...
research
03/09/2020

Fast Bayesian Record Linkage With Record-Specific Disagreement Parameters

Applied researchers are often interested in linking individuals between ...
research
07/15/2019

Confidentiality and linked data

Data providers such as government statistical agencies perform a balanci...
research
02/16/2021

VIEW: a framework for organization level interactive record linkage to support reproducible data science

Objective: To design and evaluate a general framework for interactive re...

Please sign up or login with your details

Forgot password? Click here to reset