Efficient and Effective ER with Progressive Blocking

05/28/2020
by   Sainyam Galhotra, et al.
0

Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness-efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with traditional blocking methods and progressively improve the building and scoring of blocks until we get the desired trade-off, leveraging a limited amount of ER results as a guidance at every round. We formally prove that pBlocking converges efficiently (O(n log^2 n) time complexity, where n is the total number of records). Our experiments show that incorporating partial ER output in a feedback loop can improve the efficiency and effectiveness of blocking by 5x and 60 the overall F-score of the entire ER process up to 60

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/07/2019

AutoBlock: A Hands-off Blocking Framework for Entity Matching

Entity matching seeks to identify data records over one or multiple data...
research
09/20/2016

An Ensemble Blocking Scheme for Entity Resolution of Large and Sparse Datasets

Entity Resolution, also called record linkage or deduplication, refers t...
research
05/31/2018

Skyblocking for Entity Resolution

In this paper, for the first time, we introduce the concept of skyblocki...
research
09/27/2017

Scaling Author Name Disambiguation with CNF Blocking

An author name disambiguation (AND) algorithm identifies a unique author...
research
09/13/2019

d-blink: Distributed End-to-End Bayesian Entity Resolution

Entity resolution (ER) (record linkage or de-duplication) is the process...
research
06/01/2023

Enumerating Disjoint Partial Models without Blocking Clauses

A basic algorithm for enumerating disjoint propositional models (disjoin...
research
08/19/2020

Scalable Blocking for Very Large Databases

In the field of database deduplication, the goal is to find approximatel...

Please sign up or login with your details

Forgot password? Click here to reset