We initiate the study of parallel algorithms for fairly allocating
MANA-2.0 is a scalable, future-proof design for transparent checkpointin...
Checkpoint/restart (C/R) provides fault-tolerant computing capability,
Recently, the predicate detection problem was shown to be in the paralle...
Transparently checkpointing MPI for fault tolerance and load balancing i...
Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GP...
Checkpoint-restart is now a mature technology. It allows a user to save ...
Fault tolerance for the upcoming exascale generation has long been an ar...
Providing fault-tolerance for long-running GPU-intensive jobs requires
It is common today to deploy complex software inside a virtual machine (...