Efficiently Transforming Tables for Joinability

11/18/2021
by   Arash Dargahi Nobari, et al.
0

Data from different sources rarely conform to a single formatting even if they describe the same set of entities, and this raises concerns when data from multiple sources must be joined or cross-referenced. Such a formatting mismatch is unavoidable when data is gathered from various public and third-party sources. Commercial database systems are not able to perform the join when there exist differences in data representation or formatting, and manual reformatting is both time consuming and error-prone. We study the problem of efficiently joining textual data under the condition that the join columns are not formatted the same and cannot be equi-joined, but they become joinable under some transformations. The problem is challenging simply because the number of possible transformations explodes with both the length of the input and the number of rows, even if each transformation is formed using very few basic units. We show that an efficient algorithm can be developed based on the common characteristics of the joined columns, and develop one such algorithm over a rich set of basic operations that can be composed to form transformations. We compare both the coverage and the running time of our algorithm to a state-of-the-art approach, and show that our algorithm covers every transformation that is covered in the state-of-the-art approach but is a few orders of magnitude faster, as evaluated on various real and synthetic data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/12/2023

DTT: An Example-Driven Tabular Transformer by Leveraging Large Language Models

Many organizations rely on data from government and third-party sources,...
research
12/11/2020

Discovering Multi-Table Functional Dependencies Without Full Join Computation

In this paper, we study the problem of discovering join FDs, i.e., funct...
research
09/18/2022

Scaling and Load-Balancing Equi-Joins

The task of joining two tables is fundamental for querying databases. In...
research
07/12/2021

In-Database Regression in Input Sparsity Time

Sketching is a powerful dimensionality reduction technique for accelerat...
research
07/27/2017

Approximations and Bounds for (n, k) Fork-Join Queues: A Linear Transformation Approach

Compared to basic fork-join queues, a job in (n, k) fork-join queues onl...
research
06/10/2022

Density-optimized Intersection-free Mapping and Matrix Multiplication for Join-Project Operations (extended version)

A Join-Project operation is a join operation followed by a duplicate eli...

Please sign up or login with your details

Forgot password? Click here to reset