MODC: Resilience for disaggregated memory architectures using task-based programming

09/11/2021
by   Kimberly Keeton, et al.
0

Disaggregated memory architectures provide benefits to applications beyond traditional scale out environments, such as independent scaling of compute and memory resources. They also provide an independent failure model, where computations or the compute nodes they run on may fail independently of the disaggregated memory; thus, data that's resident in the disaggregated memory is unaffected by the compute failure. Blind application of traditional techniques for resilience (e.g., checkpoints or data replication) does not take advantage of these architectures. To demonstrate the potential benefit of these architectures for resilience, we develop Memory-Oriented Distributed Computing (MODC), a framework for programming disaggregated architectures that borrows and adapts ideas from task-based programming models, concurrent programming techniques, and lock-free data structures. This framework includes a task-based application programming model and a runtime system that provides scheduling, coordination, and fault tolerance mechanisms. We present highlights of our MODC prototype and experimental results demonstrating that MODC-style resilience outperforms a checkpoint-based approach in the face of failures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2020

Fault Tolerance for Remote Memory Access Programming Models

Remote Memory Access (RMA) is an emerging mechanism for programming high...
research
10/19/2020

Towards Distributed Software Resilience in Asynchronous Many-Task Programming Models

Exceptions and errors occurring within mission critical applications due...
research
05/15/2023

Blizzard: Adding True Persistence to Main Memory Data Structures

Persistent memory (PMEM) devices present an opportunity to retain the fl...
research
11/05/2019

Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications

The persistently growing resilience concerns of large-scale computing sy...
research
10/21/2021

Model-based Reinforcement Learning for Service Mesh Fault Resiliency in a Web Application-level

Microservice-based architectures enable different aspects of web applica...
research
07/30/2019

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures

We study algorithmic approaches for recovering from the failure of sever...
research
05/16/2018

NFVactor: A Resilient NFV System using the Distributed Actor Model

Resilience functionality, including failure resilience and flow migratio...

Please sign up or login with your details

Forgot password? Click here to reset