DeepAI AI Chat
Log In Sign Up

Enhancing Failure Propagation Analysis in Cloud Computing Systems

by   Domenico Cotroneo, et al.
University of Naples Federico II

In order to plan for failure recovery, the designers of cloud systems need to understand how their system can potentially fail. Unfortunately, analyzing the failure behavior of such systems can be very difficult and time-consuming, due to the large volume of events, non-determinism, and reuse of third-party components. To address these issues, we propose a novel approach that joins fault injection with anomaly detection to identify the symptoms of failures. We evaluated the proposed approach in the context of the OpenStack cloud computing platform. We show that our model can significantly improve the accuracy of failure analysis in terms of false positives and negatives, with a low computational cost.


Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems

Cloud computing systems fail in complex and unexpected ways due to unexp...

Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning

Identifying the failure modes of cloud computing systems is a difficult ...

How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform

Cloud management systems provide abstractions and APIs for programmatica...

Anomaly detecting and ranking of the cloud computing platform by multi-view learning

Anomaly detecting as an important technical in cloud computing is applie...

Run-time Failure Detection via Non-intrusive Event Analysis in a Large-Scale Cloud Computing Platform

Cloud computing systems fail in complex and unforeseen ways due to unexp...

Online Self-Evolving Anomaly Detection in Cloud Computing Environments

Modern cloud computing systems contain hundreds to thousands of computin...

MLOps with enhanced performance control and observability

The explosion of data and its ever increasing complexity in the last few...