Benchmarking Distributed Stream Processing Engines

02/23/2018
by   Jeyhun Karimov, et al.
0

Over the last years, stream data processing has been gaining attention both in industry and in academia due to its wide range of applications. To fulfill the need for scalable and efficient stream analytics, numerous open source stream data processing systems (SDPSs) have been developed, with high throughput and low latency being their key performance targets. In this paper, we propose a framework to evaluate the performance of three SDPSs, namely Apache Storm, Apache Spark, and Apache Flink. Our evaluation focuses in particular on measuring the throughput and latency of windowed operations. For this benchmark, we design workloads based on real-life, industrial use-cases. The main contribution of this work is threefold. First, we give a definition of latency and throughput for stateful operators. Second, we completely separate the system under test and driver, so that the measurement results are closer to actual system performance under real conditions. Third, we build the first driver to test the actual sustainable performance of a system under test. Our detailed evaluation highlights that there is no single winner, but rather, each system excels in individual use-cases.

READ FULL TEXT

page 5

page 7

page 10

page 13

page 14

research
06/26/2019

Lawn: an Unbound Low Latency Timer Data Structure for Large Scale, High Throughput Systems

As demand for Real-Time applications rises among the general public, the...
research
09/14/2017

Scalable real-time processing with Spark Streaming: implementation and design of a Car Information System

Streaming data processing is a hot topic in big data these days, because...
research
03/18/2021

Hazelcast Jet: Low-latency Stream Processing at the 99.99th Percentile

Jet is an open-source, high-performance, distributed stream processor bu...
research
12/06/2018

K-Pg: Shared State in Differential Dataflows

Many of the most popular scalable data-processing frameworks are fundame...
research
07/28/2023

FleXR: A System Enabling Flexibly Distributed Extended Reality

Extended reality (XR) applications require computationally demanding fun...
research
07/08/2022

Zero-Shot Cost Models for Distributed Stream Processing

This paper proposes a learned cost estimation model for Distributed Stre...
research
02/08/2020

Performance Modeling and Analysis of a Hyperledger-based System Using GSPN

As a highly scalable permissioned blockchain platform, Hyperledger Fabri...

Please sign up or login with your details

Forgot password? Click here to reset