Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

10/04/2022
by   Grace Fan, et al.
0

Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical evaluation results on real table benchmark datasets show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index for accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).

READ FULL TEXT

page 10

page 14

page 15

page 16

research
12/15/2022

DeepJoin: Joinable Table Discovery with Pre-trained Language Models

Due to the usefulness in data enrichment for data analysis tasks, joinab...
research
01/12/2023

Pylon: Semantic Table Union Search in Data Lakes

The large size and fast growth of data repositories, such as data lakes,...
research
09/27/2022

SANTOS: Relationship-based Semantic Table Union Search

Existing techniques for unionable table search define unionability using...
research
05/03/2023

Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models

Sharding a large machine learning model across multiple devices to balan...
research
11/20/2020

Dataset Discovery in Data Lakes

Data analytics stands to benefit from the increasing availability of dat...
research
09/11/2021

Making Table Understanding Work in Practice

Understanding the semantics of tables at scale is crucial for tasks like...
research
08/07/2023

Generative Benchmark Creation for Table Union Search

Data management has traditionally relied on synthetic data generators to...

Please sign up or login with your details

Forgot password? Click here to reset