Predicting Failures in Multi-Tier Distributed Systems

11/21/2019
by   Leonardo Mariani, et al.
0

Many applications are implemented as multi-tier software systems, and are executed on distributed infrastructures, like cloud infrastructures, to benefit from the cost reduction that derives from dynamically allocating resources on-demand. In these systems, failures are becoming the norm rather than the exception, and predicting their occurrence, as well as locating the responsible faults, are essential enablers of preventive and corrective actions that can mitigate the impact of failures, and significantly improve the dependability of the systems. Current failure prediction approaches suffer either from false positives or limited accuracy, and do not produce enough information to effectively locate the responsible faults. In this paper, we present PreMiSE, a lightweight and precise approach to predict failures and locate the corresponding faults in multi-tier distributed systems. PreMiSE blends anomaly-based and signature-based techniques to identify multi-tier failures that impact on performance indicators, with high precision and low false positive rate. The experimental results that we obtained on a Cloud-based IP Multimedia Subsystem indicate that PreMiSE can indeed predict and locate possible failure occurrences with high precision and low overhead.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/02/2023

A novel failure indexing approach with run-time values of program variables

Failures with different root causes can disturb multi-fault localization...
research
08/25/2022

PREVENT: An Unsupervised Approach to Predict Software Failures in Production

This paper presents PREVENT, an approach for predicting and localizing f...
research
10/06/2021

Cloud Failure Prediction with Hierarchical Temporal Memory: An Empirical Assessment

Hierarchical Temporal Memory (HTM) is an unsupervised learning algorithm...
research
05/23/2023

Mitigating the Performance Impact of Network Failures in Public Clouds

Some faults in data center networks require hours to days to repair beca...
research
03/01/2018

Localizing Faults in Cloud Systems

By leveraging large clusters of commodity hardware, the Cloud offers gre...
research
12/15/2022

Calculation of the High-Energy Neutron Flux for Anticipating Errors and Recovery Techniques in Exascale Supercomputer Centres

The age of exascale computing has arrived and the risks associated with ...
research
08/30/2019

Enhancing Failure Propagation Analysis in Cloud Computing Systems

In order to plan for failure recovery, the designers of cloud systems ne...

Please sign up or login with your details

Forgot password? Click here to reset