Co-evolving Tracing and Fault Injection with Box of Pain

03/28/2019
by   Daniel Bittman, et al.
0

Distributed systems are hard to reason about largely because of uncertainty about what may go wrong in a particular execution, and about whether the system will mitigate those faults. Tools that perturb executions can help test whether a system is robust to faults, while tools that observe executions can help better understand their system-wide effects. We present Box of Pain, a tracer and fault injector for unmodified distributed systems that addresses both concerns by interposing at the system call level and dynamically reconstructing the partial order of communication events based on causal relationships. Box of Pain's lightweight approach to tracing and focus on simulating the effects of partial failures on communication rather than the failures themselves sets it apart from other tracing and fault injection systems. We present evidence of the promise of Box of Pain and its approach to lightweight observation and perturbation of distributed systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/23/2021

Resilient Virtualized Systems Using ReHype

System-level virtualization introduces critical vulnerabilities to failu...
research
10/07/2021

FaaSter Troubleshooting – Evaluating Distributed Tracing Approaches for Serverless Applications

Serverless applications can be particularly difficult to troubleshoot, a...
research
07/01/2019

Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

We present a set of fault injection experiments performed on the ACES (L...
research
07/27/2020

A Machine Learning Approach to Online Fault Classification in HPC Systems

As High-Performance Computing (HPC) systems strive towards the exascale ...
research
09/19/2022

Distributed Execution Indexing

This work-in-progress report presents both the design and partial evalua...
research
05/04/2023

Distributed System Fuzzing

Grey-box fuzzing is the lightweight approach of choice for finding bugs ...
research
04/26/2021

Revisiting the size effect in software fault prediction models

BACKGROUND: In object oriented (OO) software systems, class size has bee...

Please sign up or login with your details

Forgot password? Click here to reset