Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

10/01/2018
by   Gabriele D'Angelo, et al.
0

This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/05/2019

The fault-tolerant cluster-sending problem

The development of fault-tolerant distributed systems that can tolerate ...
research
10/19/2022

Fries: Fast and Consistent Runtime Reconfiguration in Dataflow Systems with Transactional Guarantees (Extended Version)

A computing job in a big data system can take a long time to run, especi...
research
06/18/2019

SeeMoRe: A Fault-Tolerant Protocol for Hybrid Cloud Environments

Large scale data management systems utilize State Machine Replication to...
research
03/16/2021

Distributed Deep Learning Using Volunteer Computing-Like Paradigm

Use of Deep Learning (DL) in commercial applications such as image class...
research
02/04/2020

Providing Insights for Queries affected by Failures and Stragglers

Interactive time responses are a crucial requirement for users analyzing...
research
08/07/2023

The FIDS Theorems: Tensions between Multinode and Multicore Performance in Transactional Systems

Traditionally, distributed and parallel transactional systems have been ...
research
10/05/2016

The Simulation Model Partitioning Problem: an Adaptive Solution Based on Self-Clustering (Extended Version)

This paper is about partitioning in parallel and distributed simulation....

Please sign up or login with your details

Forgot password? Click here to reset