How to reduce the search space of Entity Resolution: with Blocking or Nearest Neighbor search?

02/25/2022
by   George Papadakis, et al.
0

Entity Resolution suffers from quadratic time complexity. To increase its time efficiency, three kinds of filtering techniques are typically used for restricting its search space: (i) blocking workflows, which group together entity profiles with identical or similar signatures, (ii) string similarity join algorithms, which quickly detect entities more similar than a threshold, and (iii) nearest-neighbor methods, which convert every entity profile into a vector and quickly detect the closest entities according to the specified distance function. Numerous methods have been proposed for each type, but the literature lacks a comparative analysis of their relative performance. As we show in this work, this is a non-trivial task, due to the significant impact of configuration parameters on the performance of each filtering technique. We perform the first systematic experimental study that investigates the relative performance of the main methods per type over 10 real-world datasets. For each method, we consider a plethora of parameter configurations, optimizing it with respect to recall and precision. For each dataset, we consider both schema-agnostic and schema-based settings. The experimental results provide novel insights into the effectiveness and time efficiency of the considered techniques, demonstrating the superiority of blocking workflows and string similarity joins.

READ FULL TEXT

page 9

page 14

research
12/07/2019

AutoBlock: A Hands-off Blocking Framework for Entity Matching

Entity matching seeks to identify data records over one or multiple data...
research
05/15/2019

Schema-agnostic Progressive Entity Resolution (extended version)

Entity Resolution (ER) is the task of finding entity profiles that corre...
research
05/15/2019

A Survey of Blocking and Filtering Techniques for Entity Resolution

Efficiency techniques are an integral part of Entity Resolution, since i...
research
05/19/2020

Benchmarking Blocking Algorithms for Web Entities

An increasing number of entities are described by interlinked data rathe...
research
04/19/2022

Generalized Supervised Meta-blocking (technical report)

Entity Resolution constitutes a core data integration task that relies o...
research
05/15/2019

MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities

Entity Resolution (ER) aims to identify different descriptions in variou...
research
04/24/2023

Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis Benchmark]

Many recent works on Entity Resolution (ER) leverage Deep Learning techn...

Please sign up or login with your details

Forgot password? Click here to reset