How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform

07/09/2019
by   Domenico Cotroneo, et al.
0

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

READ FULL TEXT
research
10/13/2020

Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures

Software bugs in cloud management systems often cause erratic behavior, ...
research
01/18/2023

Run-time Failure Detection via Non-intrusive Event Analysis in a Large-Scale Cloud Computing Platform

Cloud computing systems fail in complex and unforeseen ways due to unexp...
research
11/13/2013

Impact of Limpware on HDFS: A Probabilistic Estimation

With the advent of cloud computing, thousands of machines are connected ...
research
08/30/2019

Enhancing Failure Propagation Analysis in Cloud Computing Systems

In order to plan for failure recovery, the designers of cloud systems ne...
research
10/23/2021

Characterizing User and Provider Reported Cloud Failures

Cloud computing is the backbone of the digital society. Digital banking,...
research
06/12/2021

Intelligent Vision Based Wear Forecasting on Surfaces of Machine Tool Elements

This paper addresses the ability to enable machines to automatically det...
research
05/12/2019

Automating chaos experiments in production

Distributed systems often face transient errors and localized component ...

Please sign up or login with your details

Forgot password? Click here to reset