Reliability and Fault-Tolerance by Choreographic Design

08/24/2017
by   Ian Cassar, et al.
0

Distributed programs are hard to get right because they are required to be open, scalable, long-running, and tolerant to faults. In particular, the recent approaches to distributed software based on (micro-)services where different services are developed independently by disparate teams exacerbate the problem. In fact, services are meant to be composed together and run in open context where unpredictable behaviours can emerge. This makes it necessary to adopt suitable strategies for monitoring the execution and incorporate recovery and adaptation mechanisms so to make distributed programs more flexible and robust. The typical approach that is currently adopted is to embed such mechanisms in the program logic, which makes it hard to extract, compare and debug. We propose an approach that employs formal abstractions for specifying failure recovery and adaptation strategies. Although implementation agnostic, these abstractions would be amenable to algorithmic synthesis of code, monitoring and tests. We consider message-passing programs (a la Erlang, Go, or MPI) that are gaining momentum both in academia and industry. Our research agenda consists of (1) the definition of formal behavioural models encompassing failures, (2) the specification of the relevant properties of adaptation and recovery strategy, (3) the automatic generation of monitoring, recovery, and adaptation logic in target languages of interest.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2023

From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs

The reliability of concurrent and distributed systems often depends on s...
research
02/13/2021

MATCH: An MPI Fault Tolerance Benchmark Suite

MPI has been ubiquitously deployed in flagship HPC systems aiming to acc...
research
04/12/2018

A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

C++ advocates exceptions as the preferred way to handle unexpected behav...
research
09/28/2022

Towards Auditable Distributed Systems

The emerging trend towards distributed (cloud) systems (DS) has widely a...
research
10/06/2022

A Distributed System-level Diagnosis Model for the Implementation of Unreliable Failure Detectors

Reliable systems require effective monitoring techniques for fault ident...
research
07/16/2020

Soft Errors Detection and Automatic Recovery based on Replication combined with different Levels of Checkpointing

Handling faults is a growing concern in HPC. In future exascale systems,...

Please sign up or login with your details

Forgot password? Click here to reset