Dataset Discovery in Data Lakes

11/20/2020
by   Alex Bogatu, et al.
0

Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash-based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2021

MATE: Multi-Attribute Table Extraction

A core operation in data discovery is to find joinable tables for a give...
research
01/12/2023

Pylon: Semantic Table Union Search in Data Lakes

The large size and fast growth of data repositories, such as data lakes,...
research
12/29/2022

WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses

Data discovery is a major challenge in enterprise data analysis: users o...
research
10/04/2022

Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

Dataset discovery from data lakes is essential in many real application ...
research
02/02/2023

Tab2KG: Semantic Table Interpretation with Lightweight Semantic Profiles

Tabular data plays an essential role in many data analytics and machine ...
research
12/17/2018

Optimizing Organizations for Navigating Data Lakes

Navigation is known to be an effective complement to search. In addition...
research
03/21/2022

IoT Data Discovery: Routing Table and Summarization Techniques

In this paper, we consider the IoT data discovery problem in very large ...

Please sign up or login with your details

Forgot password? Click here to reset