An Ensemble Blocking Scheme for Entity Resolution of Large and Sparse Datasets

09/20/2016
by   Janani Balaji, et al.
0

Entity Resolution, also called record linkage or deduplication, refers to the process of identifying and merging duplicate versions of the same entity into a unified representation. The standard practice is to use a Rule based or Machine Learning based model that compares entity pairs and assigns a score to represent the pairs' Match/Non-Match status. However, performing an exhaustive pair-wise comparison on all pairs of records leads to quadratic matcher complexity and hence a Blocking step is performed before the Matching to group similar entities into smaller blocks that the matcher can then examine exhaustively. Several blocking schemes have been developed to efficiently and effectively block the input dataset into manageable groups. At CareerBuilder (CB), we perform deduplication on massive datasets of people profiles collected from disparate sources with varying informational content. We observed that, employing a single blocking technique did not cover the base for all possible scenarios due to the multi-faceted nature of our data sources. In this paper, we describe our ensemble approach to blocking that combines two different blocking techniques to leverage their respective strengths.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/07/2019

AutoBlock: A Hands-off Blocking Framework for Entity Matching

Entity matching seeks to identify data records over one or multiple data...
research
04/19/2022

Generalized Supervised Meta-blocking (technical report)

Entity Resolution constitutes a core data integration task that relies o...
research
11/20/2019

Multi-Source Spatial Entity Linkage

Besides the traditional cartographic data sources, spatial information c...
research
05/31/2018

Skyblocking for Entity Resolution

In this paper, for the first time, we introduce the concept of skyblocki...
research
02/07/2016

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

Entity resolution (ER), an important and common data cleaning problem, i...
research
05/28/2020

Efficient and Effective ER with Progressive Blocking

Blocking is a mechanism to improve the efficiency of Entity Resolution (...
research
05/18/2020

A Bayesian Multi-Layered Record Linkage Procedure to Analyze Functional Status of Medicare Patients with Traumatic Brain Injury

Understanding the association between injury severity and patients' pote...

Please sign up or login with your details

Forgot password? Click here to reset