Automating chaos experiments in production

05/12/2019
by   Ali Basiri, et al.
0

Distributed systems often face transient errors and localized component degradation and failure. Verifying that the overall system remains healthy in the face of such failures is challenging. At Netflix, we have built a platform for automatically generating and executing chaos experiments, which check how well the production system can handle component failures and slowdowns. This paper describes the platform and our experiences operating it.

READ FULL TEXT

page 2

page 6

page 9

research
10/13/2020

Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures

Software bugs in cloud management systems often cause erratic behavior, ...
research
07/09/2019

How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform

Cloud management systems provide abstractions and APIs for programmatica...
research
01/18/2023

Run-time Failure Detection via Non-intrusive Event Analysis in a Large-Scale Cloud Computing Platform

Cloud computing systems fail in complex and unforeseen ways due to unexp...
research
08/25/2022

PREVENT: An Unsupervised Approach to Predict Software Failures in Production

This paper presents PREVENT, an approach for predicting and localizing f...
research
08/17/2022

When malloc() Never Returns NULL – Reliability as an Illusion

For decades, the guidance given to software engineers has been to check ...
research
06/12/2021

Intelligent Vision Based Wear Forecasting on Surfaces of Machine Tool Elements

This paper addresses the ability to enable machines to automatically det...

Please sign up or login with your details

Forgot password? Click here to reset