Spot-on: A Checkpointing Framework for Fault-Tolerant Long-running Workloads on Cloud Spot Instances

10/05/2022
by   Ashley Tung, et al.
0

Spot instances offer a cost-effective solution for applications running in the cloud computing environment. However, it is challenging to run long-running jobs on spot instances because they are subject to unpredictable evictions. Here, we present Spot-on, a generic software framework that supports fault-tolerant long-running workloads on spot instances through checkpoint and restart. Spot-on leverages existing checkpointing packages and is compatible with the major cloud vendors. Using a genomics application as a test case, we demonstrated that Spot-on supports both application-specific and transparent checkpointing methods. Compared to running applications using on-demand instances, it allows the completion of these workloads for a significant reduction in computing costs. Compared to running applications using application-specific checkpoint mechanisms, transparent checkpoint-protected applications reduce runtime by up to 40 to 86

READ FULL TEXT

page 1

page 2

page 3

research
03/30/2020

Provisioning Spot Instances Without Employing Fault-Tolerance Mechanisms

Cloud computing offers a variable-cost payment scheme that allows cloud ...
research
05/23/2022

An Elastic Ephemeral Datastore using Cheap, Transient Cloud Resources

Spot instances are virtual machines offered at 60-90 reclaimed at any ti...
research
12/20/2021

NavP: Enabling Navigational Programming for Science Data Processing via Application-Initiated Checkpointing

Science Data Systems (SDS) handle science data from acquisition through ...
research
07/01/2021

Scrooge Attack: Undervolting ARM Processors for Profit

Latest ARM processors are approaching the computational power of x86 arc...
research
03/15/2021

Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC

Checkpoint/restart (C/R) provides fault-tolerant computing capability, e...
research
09/03/2019

An Event-Driven Approach to Serverless Seismic Imaging in the Cloud

Adapting the cloud for high-performance computing (HPC) is a challenging...
research
07/27/2018

NDBench: Benchmarking Microservices at Scale

Software vendors often report performance numbers for the sweet spot or ...

Please sign up or login with your details

Forgot password? Click here to reset