Measuring and Predicting the Quality of a Join for Data Discovery

05/31/2023
by   Sergi Nadal, et al.
0

We study the problem of discovering joinable datasets at scale. We approach the problem from a learning perspective relying on profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a distributed and parallel fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. In contrast to the state-of-the-art, we define a novel notion of join quality that relies on a metric considering both the containment and cardinality proportion between join candidate attributes. We implement our approach in a system called NextiaJD, and present experiments to show the predictive performance and computational efficiency of our method. Our experiments show that NextiaJD obtains greater predictive performance to that of hash-based methods while we are able to scale-up to larger volumes of data.

READ FULL TEXT

page 5

page 8

research
12/01/2020

Scalable Data Discovery Using Profiles

We study the problem of discovering joinable datasets at scale. This is,...
research
12/11/2020

Discovering Multi-Table Functional Dependencies Without Full Join Computation

In this paper, we study the problem of discovering join FDs, i.e., funct...
research
05/15/2019

Improving Distributed Similarity Join in Metric Space with Error-bounded Sampling

Given two sets of objects, metric similarity join finds all similar pair...
research
04/25/2019

GPU-based Efficient Join Algorithms on Hadoop

The growing data has brought tremendous pressure for query processing an...
research
08/28/2018

Cost-efficient Data Acquisition on Online Data Marketplaces for Correlation Analysis

Incentivized by the enormous economic profits, the data marketplace plat...
research
11/16/2021

The Case for Learned In-Memory Joins

In-memory join is an essential operator in any database engine. It has b...
research
12/26/2014

Unsupervised Learning through Prediction in a Model of Cortex

We propose a primitive called PJOIN, for "predictive join," which combin...

Please sign up or login with your details

Forgot password? Click here to reset