The RECIPE Approach to Challenges in Deeply Heterogeneous High Performance Systems

03/04/2021
by   Giovanni Agosta, et al.
0

RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximize hardware lifetime and guarantee application performance is identified as the key concern for RECIPE, and is addressed via hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modelling thermal properties, mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case.

READ FULL TEXT

page 7

page 19

page 21

research
01/18/2019

The ANTAREX Domain Specific Language for High Performance Computing

The ANTAREX project relies on a Domain Specific Language (DSL) based on ...
research
07/06/2021

Energy and Thermal-aware Resource Management of Cloud Data Centres: A Taxonomy and Future Directions

This paper investigates the existing resource management approaches in C...
research
06/07/2018

Dwarf in a Giant: Enabling Scalable, High-Resolution HPC Energy Monitoring for Real-Time Profiling and Analytics

Energy efficiency, predictive maintenance and security are today key cha...
research
10/28/2018

FFT, FMM, and Multigrid on the Road to Exascale: performance challenges and opportunities

FFT, FMM, and multigrid methods are widely used fast and highly scalable...
research
08/02/2021

Energy Efficiency Aspects of the AMD Zen 2 Architecture

In High Performance Computing, systems are evaluated based on their comp...
research
05/11/2023

A Data-Driven Approach to Lightweight DVFS-Aware Counter-Based Power Modeling for Heterogeneous Platforms

Computing systems have shifted towards highly parallel and heterogeneous...
research
06/19/2016

Evaluating the predicted reliability of mechatronic systems: state of the art

Reliability analysis of mechatronic systems is a recent field and a dyna...

Please sign up or login with your details

Forgot password? Click here to reset