System-aware dynamic partitioning for batch and streaming workloads

05/31/2021
by   Zoltán Zvara, et al.
0

When processing data streams with highly skewed and nonstationary key distributions, we often observe overloaded partitions when the hash partitioning fails to balance data correctly. To avoid slow tasks that delay the completion of the whole stage of computation, it is necessary to apply adaptive, on-the-fly partitioning that continuously recomputes an optimal partitioner, given the observed key distribution. While such solutions exist for batch processing of static data sets and stateless stream processing, the task is difficult for long-running stateful streaming jobs where key distribution changes over time. Careful checkpointing and operator state migration is necessary to change the partitioning while the operation is running. Our key result is a lightweight on-the-fly Dynamic Repartitioning (DR) module for distributed data processing systems (DDPS), including Apache Spark and Flink, which improves the performance with negligible overhead. DR can adaptively repartition data during execution using our Key Isolator Partitioner (KIP). In our experiments with real workloads and power-law distributions, we reach a speedup of 1.5-6 for a variety of Spark and Flink jobs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/23/2021

Hybrid Edge Partitioner: Partitioning Large Power-Law Graphs under Memory Constraints

Distributed systems that manage and process graph-structured data intern...
research
02/27/2020

SWARM: Adaptive Load Balancing in Distributed Streaming Systems for Big Spatial Data

The proliferation of GPS-enabled devices has led to the development of n...
research
06/17/2020

Ranking and benchmarking framework for sampling algorithms on synthetic data streams

In the fields of big data, AI, and streaming processing, we work with la...
research
02/01/2022

Recursive Multi-Section on the Fly: Shared-Memory Streaming Algorithms for Hierarchical Graph Partitioning and Process Mapping

Partitioning a graph into balanced blocks such that few edges run betwee...
research
04/07/2020

GeoFlink: A Distributed and Scalable Framework for the Real-time Processing of Spatial Streams

Apache Flink is an open-source system for scalable processing of batch a...
research
02/12/2022

Jarvis: Large-scale Server Monitoring with Adaptive Near-data Processing

Rapid detection and mitigation of issues that impact performance and rel...
research
05/22/2017

On-the-fly Operation Batching in Dynamic Computation Graphs

Dynamic neural network toolkits such as PyTorch, DyNet, and Chainer offe...

Please sign up or login with your details

Forgot password? Click here to reset