Run-time Failure Detection via Non-intrusive Event Analysis in a Large-Scale Cloud Computing Platform

01/18/2023
by   Domenico Cotroneo, et al.
0

Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not accompanied by any explicit failure notification, hindering the timely detection and recovery. In this work, we propose an approach to run-time failure detection tailored for monitoring multi-tenant and concurrent cloud computing systems. The approach uses a non-intrusive form of event tracing, without manual changes to the system's internals to propagate session identifiers (IDs), and builds a set of lightweight monitoring rules from fault-free executions. We evaluated the effectiveness of the approach in detecting failures in the context of the OpenStack cloud computing platform, a complex and "off-the-shelf" distributed system, by executing a campaign of fault injection experiments in a multi-tenant scenario. Our experiments show that the approach detects the failure with an F1 score (0.85) and accuracy (0.77) higher than the ones provided by the OpenStack failure logging mechanisms (0.53 and 0.50) and two non–session-aware run-time verification approaches (both lower than 0.15). Moreover, the approach significantly decreases the average time to detect failures at run-time ( 114 seconds) compared to the OpenStack logging mechanisms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2020

Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures

Software bugs in cloud management systems often cause erratic behavior, ...
research
07/09/2019

How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform

Cloud management systems provide abstractions and APIs for programmatica...
research
09/30/2020

Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems

Cloud computing systems fail in complex and unexpected ways due to unexp...
research
08/30/2019

Enhancing Failure Propagation Analysis in Cloud Computing Systems

In order to plan for failure recovery, the designers of cloud systems ne...
research
04/06/2022

Failure Identification from Unstable Log Data using Deep Learning

The reliability of cloud platforms is of significant relevance because s...
research
05/12/2019

Automating chaos experiments in production

Distributed systems often face transient errors and localized component ...
research
08/06/2023

Leveraging Cloud Computing to Make Autonomous Vehicles Safer

The safety of autonomous vehicles (AVs) depends on their ability to perf...

Please sign up or login with your details

Forgot password? Click here to reset