Processing Database Joins over a Shared-Nothing System of Multicore Machines

04/25/2018
by   Abhirup Chakraborty, et al.
0

To process a large volume of data, modern data management systems use a collection of machines connected through a network. This paper looks into the feasibility of scaling up such a shared-nothing system while processing a compute- and communication-intensive workload---processing distributed joins. By exploiting multiple processing cores within the individual machines, we implement a system to process database joins that parallelizes computation within each node, pipelines the computation with communication, parallelizes the communication by allowing multiple simultaneous data transfers (send/receive), and removes synchronization barriers (a scalability bottleneck in a distributed data processing system). Our experimental results show that using only four threads per node the framework achieves a 3.5x gains in intra-node performance while compared with a single-threaded counterpart. Moreover, with the join processing workload the cluster-wide performance (and speedup) is observed to be dictated by the intra-node computational loads; this property brings a near-linear speedup with increasing nodes in the system, a feature much desired in modern large-scale data processing system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2018

System G Distributed Graph Database

Motivated by the need to extract knowledge and value from interconnected...
research
10/22/2021

RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework (Extended Version)

Frequent itemset mining (FIM) is a highly computational and data intensi...
research
07/08/2021

HTCondor data movement at 100 Gbps

HTCondor is a major workload management system used in distributed high ...
research
10/12/2022

Timestamp tokens: a better coordination primitive for data-processing systems

Distributed data processing systems have advanced through models that ex...
research
01/20/2022

The Specialized High-Performance Network on Anton 3

Molecular dynamics (MD) simulation, a computationally intensive method t...
research
03/24/2022

GX-Plug: a Middleware for Plugging Accelerators to Distributed Graph Processing

Recently, research communities highlight the necessity of formulating a ...
research
01/29/2023

Large-scale Data Modelling in Hive and Distributed Query Processing using MapReduce and Tez

Huge amounts of data being generated continuously by digitally interconn...

Please sign up or login with your details

Forgot password? Click here to reset