Asymptotic efficiency of restart and checkpointing

02/21/2018
by   Antonio Sodre, et al.
0

Many tasks are subject to failure before completion. Two of the most common failure recovery strategies are restart and checkpointing. Under restart, once a failure occurs, it is restarted from the beginning. Under checkpointing, the task is resumed from the preceding checkpoint after the failure. We study asymptotic efficiency of restart for an infinite sequence of tasks, whose sizes form a stationary sequence. We define asymptotic efficiency as the limit of the ratio of the total time to completion in the absence of failures over the total time to completion when failures take place. Whether the asymptotic efficiency is positive or not depends on the comparison of the tail of the distributions of the task size and the random variables governing failures. Our framework allows for variations in the failure rates and dependencies between task sizes. We also study a similar notion of asymptotic efficiency for checkpointing when the task is infinite a.s. and the inter-checkpoint times are i.i.d.. Moreover, in checkpointing, when the failures are exponentially distributed, we prove the existence of an infinite sequence of universal checkpoints, which are always used whenever the system starts from any checkpoint that precedes them.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/28/2019

A Note on the Asymptotic Optimality of Work-Conserving Disciplines in Completion Time Minimization

In this paper, we prove that under mild stochastic assumptions, work-con...
research
08/25/2022

Elly: A Real-Time Failure Recovery and Data Collection System for Robotic Manipulation

Even the most robust autonomous behaviors can fail. The goal of this res...
research
01/03/2023

On a probabilistic extension of the Oldenburger-Kolakoski sequence

The Oldenburger-Kolakoski sequence is the only infinite sequence over th...
research
05/23/2023

Failure-Sentient Composition For Swarm-Based Drone Services

We propose a novel failure-sentient framework for swarm-based drone deli...
research
11/05/2019

Failure Analysis and Quantification for Contemporary and Future Supercomputers

Large-scale computing systems today are assembled by numerous computing ...
research
04/18/2017

A Study of Deep Learning Robustness Against Computation Failures

For many types of integrated circuits, accepting larger failure rates in...

Please sign up or login with your details

Forgot password? Click here to reset