Scaling and Load-Balancing Equi-Joins

09/18/2022
by   Ahmed Metwally, et al.
0

The task of joining two tables is fundamental for querying databases. In this paper, we focus on the equi-join problem, where a pair of records from the two joined tables are part of the join results if equality holds between their values in the join column(s). While this is a tractable problem when the number of records in the joined tables is relatively small, it becomes very challenging as the table sizes increase, especially if hot keys (join column values with a large number of records) exist in both joined tables. This paper, an extended version of [metwally-SIGMOD-2022], proposes Adaptive-Multistage-Join (AM-Join) for scalable and fast equi-joins in distributed shared-nothing architectures. AM-Join utilizes (a) Tree-Join, a proposed novel algorithm that scales well when the joined tables share hot keys, and (b) Broadcast-Join, the known fastest when joining keys that are hot in only one table. Unlike the state-of-the-art algorithms, AM-Join (a) holistically solves the join-skew problem by achieving load balancing throughout the join execution, and (b) supports all outer-join variants without record deduplication or custom table partitioning. For the fastest AM-Join outer-join performance, we propose the Index-Broadcast-Join (IB-Join) family of algorithms for Small-Large joins, where one table fits in memory and the other can be up to orders of magnitude larger. The outer-join variants of IB-Join improves on the state-of-the-art Small-Large outer-join algorithms. The proposed algorithms can be adopted in any shared-nothing architecture. We implemented a MapReduce version using Spark. Our evaluation shows the proposed algorithms execute significantly faster and scale to more skewed and orders-of-magnitude bigger tables when compared to the state-of-the-art algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2021

MATE: Multi-Attribute Table Extraction

A core operation in data discovery is to find joinable tables for a give...
research
03/14/2023

One Size Cannot Fit All: a Self-Adaptive Dispatcher for Skewed Hash Join in Shared-nothing RDBMSs

Shared-nothing architecture has been widely adopted in various commercia...
research
10/26/2020

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

Finding joinable tables in data lakes is key procedure in many applicati...
research
11/18/2021

Efficiently Transforming Tables for Joinability

Data from different sources rarely conform to a single formatting even i...
research
01/07/2022

Weighted Random Sampling over Joins

Joining records with all other records that meet a linkage condition can...
research
02/26/2018

Adaptive Geospatial Joins for Modern Hardware

Geospatial joins are a core building block of connected mobility applica...
research
06/10/2022

Density-optimized Intersection-free Mapping and Matrix Multiplication for Join-Project Operations (extended version)

A Join-Project operation is a join operation followed by a duplicate eli...

Please sign up or login with your details

Forgot password? Click here to reset