Pylon: Semantic Table Union Search in Data Lakes

01/12/2023
by   Tianji Cong, et al.
0

The large size and fast growth of data repositories, such as data lakes, has spurred the need for data discovery to help analysts find related data. The problem has become challenging as (i) a user typically does not know what datasets exist in an enormous data repository; and (ii) there is usually a lack of a unified data model to capture the interrelationships between heterogeneous datasets from disparate sources. In this work, we address one important class of discovery needs: finding union-able tables. The task is to find tables in a data lake that can be unioned with a given query table. The challenge is to recognize union-able columns even if they are represented differently. In this paper, we propose a data-driven learning approach: specifically, an unsupervised representation learning and embedding retrieval task. Our key idea is to exploit self-supervised contrastive learning to learn an embedding model that takes into account the indexing/search data structure and produces embeddings close by for columns with semantically similar values while pushing apart columns with semantically dissimilar values. We then find union-able tables based on similarities between their constituent columns in embedding space. On a real-world data lake, we demonstrate that our best-performing model achieves significant improvements in precision (16%↑), recall (17%↑), and query response time (7x faster) compared to the state-of-the-art.

READ FULL TEXT

page 10

page 11

research
12/29/2022

WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses

Data discovery is a major challenge in enterprise data analysis: users o...
research
10/04/2022

Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

Dataset discovery from data lakes is essential in many real application ...
research
11/20/2020

Dataset Discovery in Data Lakes

Data analytics stands to benefit from the increasing availability of dat...
research
08/07/2023

Generative Benchmark Creation for Table Union Search

Data management has traditionally relied on synthetic data generators to...
research
09/27/2022

SANTOS: Relationship-based Semantic Table Union Search

Existing techniques for unionable table search define unionability using...
research
12/15/2022

DeepJoin: Joinable Table Discovery with Pre-trained Language Models

Due to the usefulness in data enrichment for data analysis tasks, joinab...
research
03/12/2019

Termite: A System for Tunneling Through Heterogeneous Data

Data-driven analysis is important in virtually every modern organization...

Please sign up or login with your details

Forgot password? Click here to reset