Resilient Virtualized Systems Using ReHype

01/23/2021
by   Michael Le, et al.
0

System-level virtualization introduces critical vulnerabilities to failures of the software components that implement virtualization – the virtualization infrastructure (VI). To mitigate the impact of such failures, we introduce a resilient VI (RVI) that can recover individual VI components from failure, caused by hardware or software faults, transparently to the hosted virtual machines (VMs). Much of the focus is on the ReHype mechanism for recovery from hypervisor failures, that can lead to state corruption and to inconsistencies among the states of system components. ReHype's implementation for the Xen hypervisor was done incrementally, using fault injection results to identify sources of critical corruption and inconsistencies. This implementation involved 900 LOC, with memory space overhead of 2.1MB. Fault injection campaigns, with a variety of fault types, show that ReHype can successfully recover, in less than 750ms, from over 88 addition to ReHype, recovery mechanisms for the other VI components are described. The overall effectiveness of our RVI is evaluated hosting a Web service application, on a cluster of VMs. With faults in any VI component, for over 87 by the application to be continuously maintained despite the resulting failures of VI components.

READ FULL TEXT

page 15

page 27

research
07/01/2019

Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

We present a set of fault injection experiments performed on the ACES (L...
research
03/28/2019

Co-evolving Tracing and Fault Injection with Box of Pain

Distributed systems are hard to reason about largely because of uncertai...
research
08/18/2019

Feedback-based, Automated Failure Testing of Microservice-based Applications

Modern distributed applications are moving toward a microservice archite...
research
01/19/2022

ThorFI: A Novel Approach for Network Fault Injection as a Service

In this work, we present a novel fault injection solution (ThorFI) for v...
research
07/19/2022

Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems

Fault localization is challenging in an online service system due to its...
research
07/08/2020

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

As computers reach exascale and beyond, the incidence of faults will inc...
research
02/05/2015

OS-level Failure Injection with SystemTap

Failure injection in distributed systems has been an important issue to ...

Please sign up or login with your details

Forgot password? Click here to reset