The RECIPE Approach to Challenges in Deeply Heterogeneous High Performance Systems

by   Giovanni Agosta, et al.

RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximize hardware lifetime and guarantee application performance is identified as the key concern for RECIPE, and is addressed via hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modelling thermal properties, mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case.


page 7

page 19

page 21


The ANTAREX Domain Specific Language for High Performance Computing

The ANTAREX project relies on a Domain Specific Language (DSL) based on ...

Energy and Thermal-aware Resource Management of Cloud Data Centres: A Taxonomy and Future Directions

This paper investigates the existing resource management approaches in C...

Dwarf in a Giant: Enabling Scalable, High-Resolution HPC Energy Monitoring for Real-Time Profiling and Analytics

Energy efficiency, predictive maintenance and security are today key cha...

FFT, FMM, and Multigrid on the Road to Exascale: performance challenges and opportunities

FFT, FMM, and multigrid methods are widely used fast and highly scalable...

Energy Efficiency Aspects of the AMD Zen 2 Architecture

In High Performance Computing, systems are evaluated based on their comp...

A Data-Driven Approach to Lightweight DVFS-Aware Counter-Based Power Modeling for Heterogeneous Platforms

Computing systems have shifted towards highly parallel and heterogeneous...

Evaluating the predicted reliability of mechatronic systems: state of the art

Reliability analysis of mechatronic systems is a recent field and a dyna...

Please sign up or login with your details

Forgot password? Click here to reset