DeepAI AI Chat
Log In Sign Up

Automating chaos experiments in production

05/12/2019
by   Ali Basiri, et al.
Netflix
0

Distributed systems often face transient errors and localized component degradation and failure. Verifying that the overall system remains healthy in the face of such failures is challenging. At Netflix, we have built a platform for automatically generating and executing chaos experiments, which check how well the production system can handle component failures and slowdowns. This paper describes the platform and our experiences operating it.

READ FULL TEXT

page 2

page 6

page 9

10/13/2020

Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures

Software bugs in cloud management systems often cause erratic behavior, ...
07/09/2019

How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform

Cloud management systems provide abstractions and APIs for programmatica...
01/18/2023

Run-time Failure Detection via Non-intrusive Event Analysis in a Large-Scale Cloud Computing Platform

Cloud computing systems fail in complex and unforeseen ways due to unexp...
08/25/2022

PREVENT: An Unsupervised Approach to Predict Software Failures in Production

This paper presents PREVENT, an approach for predicting and localizing f...
08/17/2022

When malloc() Never Returns NULL – Reliability as an Illusion

For decades, the guidance given to software engineers has been to check ...
06/12/2021

Intelligent Vision Based Wear Forecasting on Surfaces of Machine Tool Elements

This paper addresses the ability to enable machines to automatically det...