Towards Management of Energy Consumption in HPC Systems with Fault Tolerance

12/21/2020
by   Marina Morán, et al.
0

High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. In this article, we focus on the processes that continue execution, and propose a series of strategies to manage energy consumption when a failure occurs and uncoordinated checkpoints are used. We present an energy model to evaluate strategies and through simulation we analyze the behavior of an application under different configurations and failure time. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.

READ FULL TEXT

page 12

page 13

research
02/18/2020

Energy-Efficiency Routing algorithms in Wireless Sensor Networks: a Survey

A Wireless Sensor Network (WSN) is a collection of tiny nodes that have ...
research
06/26/2019

Q-Learning Inspired Self-Tuning for Energy Efficiency in HPC

System self-tuning is a crucial task to lower the energy consumption of ...
research
02/13/2021

Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

Scaling supercomputers comes with an increase in failure rates due to th...
research
10/31/2019

Direct N-body application on low-power and energy-efficient parallel architectures

The aim of this work is to quantitatively evaluate the impact of computa...
research
05/10/2023

Challenges in Automatic Software Optimization: the Energy Efficiency Case

With the advent of the Exascale capability allowing supercomputers to pe...
research
12/29/2020

Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement

HPC systems keep growing in size to meet the ever-increasing demand for ...
research
06/13/2018

Pricing Schemes for Energy-Efficient HPC Systems: Design and Exploration

Energy efficiency is of paramount importance for the sustainability of H...

Please sign up or login with your details

Forgot password? Click here to reset