A Utilization Model for Optimization of Checkpoint Intervals in Distributed Stream Processing Systems

11/27/2019
by   Sachini Jayasekara, et al.
0

State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the Exascale regime, and is evidently more efficient than replication as state size grows. However current systems use a nominal value for the checkpoint interval, indicative of assuming roughly 1 failure every 19 days, that does not take into account the salient aspects of the checkpoint process, nor the system scale, which can readily lead to inefficient system operation. To address this shortcoming, we provide a rigorous derivation of utilization – the fraction of total time available for the system to do useful work – that incorporates checkpoint interval, failure rate, checkpoint cost, failure detection and restart cost, depth of the system topology and message delay. Our model yields an elegant expression for utilization and provides an optimal checkpoint interval given these parameters, interestingly showing it to be dependent only on checkpoint cost and failure rate. We confirm the accuracy and efficacy of our model through experiments with Apache Flink, where we obtain improvements in system utilization for every case, especially as the system size increases. Our model provides a solid theoretical basis for the analysis and optimization of more elaborate checkpointing approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2019

Optimal Multi-Level Interval-based Checkpointing for Exascale Stream Processing Systems

State-of-the-art stream processing platforms make use of checkpointing t...
research
09/06/2021

Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

Distributed Stream Processing systems are becoming an increasingly essen...
research
02/26/2018

Controlling Human Utilization of Failure-Prone Systems via Taxes

We consider a game-theoretic model where individuals compete over a shar...
research
08/12/2020

The network footprint of replication in popular DBMSs

Database replication is an important component of reliable, disaster tol...
research
04/29/2018

Investigating Power Outage Effects on Reliability of Solid-State Drives

Solid-State Drives (SSDs) are recently employed in enterprise servers an...
research
01/08/2021

Optimization Models for Integrated Biorefinery Operations

Variations of physical and chemical characteristics of biomass lead to a...

Please sign up or login with your details

Forgot password? Click here to reset