One Size Cannot Fit All: a Self-Adaptive Dispatcher for Skewed Hash Join in Shared-nothing RDBMSs

03/14/2023
by   Jinxin Yang, et al.
0

Shared-nothing architecture has been widely adopted in various commercial distributed RDBMSs. Thanks to the architecture, query can be processed in parallel and accelerated by scaling up the cluster horizontally on demand. In spite of that, load balancing has been a challenging issue in all distributed RDBMSs, including shared-nothing ones, which suffers much from skewed data distribution. In this work, we focus on one of the representative operator, namely Hash Join, and investigate how skewness among the nodes of a cluster will affect the load balance and eventual efficiency of an arbitrary query in shared-nothing RDBMSs. We found that existing Distributed Hash Join (Dist-HJ) solutions may not provide satisfactory performance when a value is skewed in both the probe and build tables. To address that, we propose a novel Dist-HJ solution, namely Partition and Replication (PnR). Although PnR provide the best efficiency in some skewness scenario, our exhaustive experiments over a group of shared-nothing RDBMSs show that there is not a single Dist-HJ solution that wins in all (data skew) scenarios. To this end, we further propose a self-adaptive Dist-HJ solution with a builtin sub-operator cost model that dynamically select the best Dist-HJ implementation strategy at runtime according to the data skew of the target query. We implement the solution in our commercial shared-nothing RDBMSs, namely KaiwuDB (former name ZNBase) and empirical study justifies that the self-adaptive model achieves the best performance comparing to a series of solution adopted in many existing RDBMSs.

READ FULL TEXT
research
09/18/2022

Scaling and Load-Balancing Equi-Joins

The task of joining two tables is fundamental for querying databases. In...
research
03/10/2021

MP-RW-LSH: An Efficient Multi-Probe LSH Solution to ANNS in L_1 Distance

Approximate Nearest Neighbor Search (ANNS) is a fundamental algorithmic ...
research
06/01/2020

DHash: Enabling Dynamic and Efficient Hash Tables

Given a specified average load factor, hash tables offer the appeal of c...
research
10/18/2019

DLB: Deep Learning Based Load Balancing

Load balancing mechanisms have been widely adopted by distributed platfo...
research
12/05/2021

Design Trade-offs for a Robust Dynamic Hybrid Hash Join (Extended Version)

The Join operator, as one of the most expensive and commonly used operat...
research
11/16/2013

The Optimization of Running Queries in Relational Databases Using ANT-Colony Algorithm

The issue of optimizing queries is a cost-sensitive process and with res...
research
05/24/2021

DynaHash: Efficient Data Rebalancing in Apache AsterixDB (Extended Version)

Parallel shared-nothing data management systems have been widely used to...

Please sign up or login with your details

Forgot password? Click here to reset