Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs

02/11/2021
by   Morgan Geldenhuys, et al.
0

Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered to. Typically, systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing Checkpoint and Rollback Recovery. However, this is an expensive operation which impacts negatively on the overall performance of the system and manually optimizing fault tolerance for specific jobs is a difficult and time consuming task. In this paper we introduce Chiron, an approach for automatically optimizing the frequency with which checkpoints are performed in streaming jobs. For any chosen job, parallel profiling runs are performed, each containing a variant of the configurations, with the resulting metrics used to model the impact of checkpoint-based fault tolerance on performance and availability. Understanding these relationships is key to minimizing performance objectives and meeting strict Quality-of-Service constraints. We implemented Chiron prototypically together with Apache Flink and demonstrate its usefulness experimentally.

READ FULL TEXT
research
09/06/2021

Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

Distributed Stream Processing systems are becoming an increasingly essen...
research
11/12/2018

On the Performance and Convergence of Distributed Stream Processing via Approximate Fault Tolerance

Fault tolerance is critical for distributed stream processing systems, y...
research
12/26/2019

Fault Tolerance in SDN Data Plane Considering Network and Application Based Metrics

Failures in networks result in service disruptions which may cause deter...
research
06/05/2023

Better Write Amplification for Streaming Data Processing

Many current applications have to perform data processing in a streaming...
research
01/10/2020

Fault Tolerance for Service Function Chains

Traffic in enterprise networks typically traverses a sequence of middleb...
research
07/05/2022

A Stochastic Game Approach to Masking Fault-Tolerance: Bisimulation and Quantification

We introduce a formal notion of masking fault-tolerance between probabil...
research
06/20/2022

Phoebe: QoS-Aware Distributed Stream Processing through Anticipating Dynamic Workloads

Distributed Stream Processing systems have become an essential part of b...

Please sign up or login with your details

Forgot password? Click here to reset