Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

07/01/2019
by   Valerio Formicola, et al.
0

We present a set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo. We use this experimental campaign to improve the understanding of failure causes and propagation that we observed in the field failure data analysis of NCSA's Blue Waters. We use the data collected from the logs and from network performance counter data 1) to characterize the fault-error-failure sequence and recovery mechanisms in the Gemini network and in the Cray compute nodes, 2) to understand the impact of failures on the system and the user applications at different scale, and 3) to identify and recreate fault scenarios that induce unrecoverable failures, in order to create new tests for system and application design. The faults were injected through special input commands to bring down network links, directional connections, nodes, and blades. We present extensions that will be needed to apply our methodologies of injection and analysis to the Cray XC (Aries) systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/23/2021

Resilient Virtualized Systems Using ReHype

System-level virtualization introduces critical vulnerabilities to failu...
research
02/05/2015

OS-level Failure Injection with SystemTap

Failure injection in distributed systems has been an important issue to ...
research
04/29/2018

Investigating Power Outage Effects on Reliability of Solid-State Drives

Solid-State Drives (SSDs) are recently employed in enterprise servers an...
research
03/29/2019

Automatic Failure Explanation in CPS Models

Debugging Cyber-Physical System (CPS) models can be extremely complex. I...
research
03/24/2020

Recovery command generation towards automatic recovery in ICT systems by Seq2Seq learning

With the increase in scale and complexity of ICT systems, their operatio...
research
02/02/2021

Reinforcement Learning with Probabilistic Boolean Network Models of Smart Grid Devices

The area of Smart Power Grids needs to constantly improve its efficiency...
research
03/28/2019

Co-evolving Tracing and Fault Injection with Box of Pain

Distributed systems are hard to reason about largely because of uncertai...

Please sign up or login with your details

Forgot password? Click here to reset