Cost-efficient Data Acquisition on Online Data Marketplaces for Correlation Analysis

08/28/2018
by   Yanying Li, et al.
0

Incentivized by the enormous economic profits, the data marketplace platform has been proliferated recently. In this paper, we consider the data marketplace setting where a data shopper would like to buy data instances from the data marketplace for correlation analysis of certain attributes. We assume that the data in the marketplace is dirty and not free. The goal is to find the data instances from a large number of datasets in the marketplace whose join result not only is of high-quality and rich join informativeness, but also delivers the best correlation between the requested attributes. To achieve this goal, we design DANCE, a middleware that provides the desired data acquisition service. DANCE consists of two phases: (1) In the off-line phase, it constructs a two-layer join graph from samples. The join graph consists of the information of the datasets in the marketplace at both schema and instance levels; (2) In the online phase, it searches for the data instances that satisfy the constraints of data quality, budget, and join informativeness, while maximize the correlation of source and target attribute sets. We prove that the complexity of the search problem is NP-hard, and design a heuristic algorithm based on Markov chain Monte Carlo (MCMC). Experiment results on two benchmark datasets demonstrate the efficiency and effectiveness of our heuristic data acquisition algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/15/2020

Instance Optimal Join Size Estimation

We consider the problem of efficiently estimating the size of the inner ...
research
05/15/2019

Improving Distributed Similarity Join in Metric Space with Error-bounded Sampling

Given two sets of objects, metric similarity join finds all similar pair...
research
07/31/2023

ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Join Algorithms via Reinforcement Learning

The performance of worst-case optimal join algorithms depends on the ord...
research
05/31/2023

Measuring and Predicting the Quality of a Join for Data Discovery

We study the problem of discovering joinable datasets at scale. We appro...
research
08/23/2019

Efficient Join Processing Over Incomplete Data Streams (Technical Report)

For decades, the join operator over fast data streams has always drawn m...
research
10/15/2019

Optimizing Semi-Stream CACHEJOIN for Near-Real-Time Data Warehousing

Streaming data join is a critical process in the field of near-real-time...
research
03/26/2021

Synthesizing Linked Data Under Cardinality and Integrity Constraints

The generation of synthetic data is useful in multiple aspects, from tes...

Please sign up or login with your details

Forgot password? Click here to reset