Self-healing Dilemmas in Distributed Systems: Fault-correction vs. Fault-tolerance

07/10/2020
by   Jovan Nikolic, et al.
0

Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: Fault-detection inherits network uncertainties making a faulty process indistinguishable from a slow process. The implications can be dramatic: Self-healing mechanisms become biased and cost-ineffective. In particular, triggering an undesirable fault-correction results in new faults that could be prevented with fault-tolerance instead. Nevertheless, fault-tolerance alone without eventually correcting persistent faults makes systems underperforming as well. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several application domains of energy, transport and health. This paper introduces a novel and general-purpose modeling of fault scenarios. They can accurately measure and predict inconsistencies generated by fault-correction and fault-tolerance when each node in a network can monitor the health status of another node, while both can defect. In contrast to related work, no information about the computational/application scenario, overlying algorithms or application data is required. A rigorous experimental methodology is designed that evaluates 696 experimental settings of different fault scales, fault profiles and fault detection thresholds, each with almost 9M measurements of inconsistencies in a prototyped decentralized network of 3000 nodes. The prediction performance of the modeled fault scenarios is validated in a challenging application scenario of decentralized and dynamic in-network aggregation using real-world data from a Smart Grid pilot project. Findings confirm the origin of inconsistencies at design phase and provide new insights how to tune self-healing mechanisms at design phase.

READ FULL TEXT
research
03/24/2022

Evaluation of IoT Self-healing Mechanisms using Fault-Injection in Message Brokers

The widespread use of Internet-of-Things (IoT) across different applicat...
research
05/07/2018

Holarchic Structures for Decentralized Deep Learning - A Performance Analysis

Structure plays a key role in learning performance. In centralized compu...
research
09/17/2023

Predictive Fault Tolerance for Autonomous Robot Swarms

Active fault tolerance is essential for robot swarms to retain long-term...
research
12/02/2022

DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep Surrogate Model

The emergence of latency-critical AI applications has been supported by ...
research
12/04/2021

PreGAN: Preemptive Migration Prediction Network for Proactive Fault-Tolerant Edge Computing

Building a fault-tolerant edge system that can quickly react to node ove...
research
07/03/2023

Internet of Things Fault Detection and Classification via Multitask Learning

This paper presents a comprehensive investigation into developing a faul...

Please sign up or login with your details

Forgot password? Click here to reset