A Critical Re-evaluation of Benchmark Datasets for (Deep) Learning-Based Matching Algorithms

07/03/2023
by   George Papadakis, et al.
0

Entity resolution (ER) is the process of identifying records that refer to the same entities within one or across multiple databases. Numerous techniques have been developed to tackle ER challenges over the years, with recent emphasis placed on machine and deep learning methods for the matching phase. However, the quality of the benchmark datasets typically used in the experimental evaluations of learning-based matching algorithms has not been examined in the literature. To cover this gap, we propose four different approaches to assessing the difficulty and appropriateness of 13 established datasets: two theoretical approaches, which involve new measures of linearity and existing measures of complexity, and two practical approaches: the difference between the best non-linear and linear matchers, as well as the difference between the best learning-based matcher and the perfect oracle. Our analysis demonstrates that most of the popular datasets pose rather easy classification tasks. As a result, they are not suitable for properly evaluating learning-based matching algorithms. To address this issue, we propose a new methodology for yielding benchmark datasets. We put it into practice by creating four new matching tasks, and we verify that these new benchmarks are more challenging and therefore more suitable for further advancements in the field.

READ FULL TEXT

page 9

page 12

page 13

research
04/24/2023

Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis Benchmark]

Many recent works on Entity Resolution (ER) leverage Deep Learning techn...
research
10/21/2020

Neural Networks for Entity Matching

Entity matching is the problem of identifying which records refer to the...
research
07/22/2021

Frost: Benchmarking and Exploring Data Matching Results

"Bad" data has a direct impact on 88 losing 12 representations of the sa...
research
05/12/2022

Bridging the Gap between Reality and Ideality of Entity Matching: A Revisiting and Benchmark Re-Construction

Entity matching (EM) is the most critical step for entity resolution (ER...
research
10/08/2013

The role of RGB-D benchmark datasets: an overview

The advent of the Microsoft Kinect three years ago stimulated not only t...
research
08/24/2018

BOP: Benchmark for 6D Object Pose Estimation

We propose a benchmark for 6D pose estimation of a rigid object from a s...
research
08/10/2019

A Critical Note on the Evaluation of Clustering Algorithms

Experimental evaluation is a major research methodology for investigatin...

Please sign up or login with your details

Forgot password? Click here to reset