A Distributed System-level Diagnosis Model for the Implementation of Unreliable Failure Detectors

10/06/2022
by   Elias P. Duarte Jr., et al.
0

Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the de facto standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Results are presented for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/24/2022

A platform for causal knowledge representation and inference in industrial fault diagnosis based on cubic DUCG

The working conditions of large-scale industrial systems are very comple...
research
07/17/2011

A Temporal Neuro-Fuzzy Monitoring System to Manufacturing Systems

Fault diagnosis and failure prognosis are essential techniques in improv...
research
11/12/2018

You Only Live Multiple Times: A Blackbox Solution for Reusing Crash-Stop Algorithms In Realistic Crash-Recovery Settings

Distributed agreement-based algorithms are often specified in a crash-st...
research
05/22/2022

Monitoring of Perception Systems: Deterministic, Probabilistic, and Learning-based Fault Detection and Identification

This paper investigates runtime monitoring of perception systems. Percep...
research
04/09/2020

DCO Analyzer: Local Controllability and Observability Analysis and Enforcement of Distributed Test Scenarios

To ensure interoperability and the correct behavior of heterogeneous dis...
research
05/04/2022

Angular Control Charts: A New Perspective for Monitoring Reliability of Multi-State Systems

Control charts, as had been used traditionally for quality monitoring, w...
research
08/24/2017

Reliability and Fault-Tolerance by Choreographic Design

Distributed programs are hard to get right because they are required to ...

Please sign up or login with your details

Forgot password? Click here to reset