Efficient Time-Evolving Stream Processing at Scale

06/03/2018
by   Yu Huang, et al.
0

Time-evolving stream datasets exist ubiquitously in many real-world applications where their inherent hot keys often evolve over times. Nevertheless, few existing solutions can provide efficient load balance on these time-evolving datasets while preserving low memory overhead. In this paper, we present a novel grouping approach (named FISH), which can provide the efficient time-evolving stream processing at scale. The key insight of this work is that the keys of time-evolving stream data can have a skewed distribution within any bounded distance of time interval. This enables to accurately identify the recent hot keys for the real-time load balance within a bounded scope. We therefore propose an epoch-based recent hot key identification with specialized intra-epoch frequency counting (for maintaining low memory overhead) and inter-epoch hotness decaying (for suppressing superfluous computation). We also propose to heuristically infer the accurate information of remote workers through computation rather than communication for cost-efficient worker assignment. We have integrated our approach into Apache Storm. Our results on a cluster of 128 nodes for both synthetic and real-world stream datasets show that FISH significantly outperforms state-of-the-art with the average and the 99th percentile latency reduction by 87.12 W-Choices), and memory overhead reduction by 99.96

READ FULL TEXT

page 4

page 7

page 12

research
11/03/2017

Elasticutor: Rapid Elasticity for Realtime Stateful Stream Processing

Elasticity is highly desirable for stream processing systems to guarante...
research
11/18/2022

PIM-tree: A Skew-resistant Index for Processing-in-Memory

The performance of today's in-memory indexes is bottlenecked by the memo...
research
05/31/2022

Discovery of Keys for Graphs [Extended Version]

Keys for graphs uses the topology and value constraints needed to unique...
research
04/28/2016

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

While cluster computing frameworks are continuously evolving to provide ...
research
12/23/2018

AnchorHash: A Scalable Consistent Hash

Consistent hashing (CH) is a central building block in many networking a...
research
06/24/2022

SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids and Medoid Voting

Sequence clustering in a streaming environment is challenging because it...
research
12/20/2020

IntersectX: An Efficient Accelerator for Graph Mining

Graph pattern mining applications try to find all embeddings that match ...

Please sign up or login with your details

Forgot password? Click here to reset