Alaska: A Flexible Benchmark for Data Integration Tasks

01/27/2021
by   Valter Crescenzi, et al.
0

Data integration is a long-standing interest of the data management community and has many disparate applications, including business, science and government. We have recently witnessed impressive results in specific data integration tasks, such as Entity Resolution, thanks to the increasing availability of benchmarks. A limitation of such benchmarks is that they typically come with their own task definition and it can be difficult to leverage them for complex integration pipelines. As a result, evaluating end-to-end pipelines for the entire data integration process is still an elusive goal. In this work, we present Alaska, the first benchmark based on real-world dataset to support seamlessly multiple tasks (and their variants) of the data integration pipeline. The dataset consists of  70k heterogeneous product specifications from 71 e-commerce websites with thousands of different product attributes. Our benchmark comes with profiling meta-data, a set of pre-defined use cases with diverse characteristics, and an extensive manually curated ground truth. We demonstrate the flexibility of our benchmark by focusing on several variants of two crucial data integration tasks, Schema Matching and Entity Resolution. Our experiments show that our benchmark enables the evaluation of a variety of methods that previously were difficult to compare, and can foster the design of more holistic data integration solutions.

READ FULL TEXT
research
06/15/2021

Machamp: A Generalized Entity Matching Benchmark

Entity Matching (EM) refers to the problem of determining whether two di...
research
10/15/2020

Survive the Schema Changes: Integration of Unmanaged Data Using Deep Learning

Data is the king in the age of AI. However data integration is often a l...
research
08/23/2022

FlexER: Flexible Entity Resolution for Multiple Intents

Entity resolution, a longstanding problem of data cleaning and integrati...
research
01/23/2023

WDC Products: A Multi-Dimensional Entity Matching Benchmark

The difficulty of an entity matching task depends on a combination of mu...
research
07/08/2022

Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation

Machine learning (ML) is playing an increasingly important role in data ...
research
05/22/2023

EntRED: Benchmarking Relation Extraction with Fewer Shortcuts

Entity names play an effective role in relation extraction (RE) and ofte...
research
03/02/2021

Technical Report on Data Integration and Preparation

AI application developers typically begin with a dataset of interest and...

Please sign up or login with your details

Forgot password? Click here to reset