Robust online identification of thermal models for in-production HPC clusters with machine learning-based data selection

10/03/2018
by   Federico Pittino, et al.
0

Power and thermal management are critical components of high performance computing (HPC) systems, due to their high power density and large total power consumption. The assessment of thermal dissipation by means of compact models directly from the thermal response of the final device enables more robust and precise thermal control strategies as well as automated diagnosis. However, when dealing with large scale systems "in production" the accuracy of learned thermal models depends on the dynamics of the power excitation, which depends also on the executed workload, and measurement nonidealities, such as quantization. In this paper we show that, using an advanced system identification algorithm, we are able to generate very accurate thermal models (average error lower than our sensors quantization step of 1C) for a large scale HPC system on real workloads. However, we also show that: 1) not all real workloads allow for the identification of a good model; 2) starting from the theory of system identification it is very difficult to evaluate if a trace of data leads to a good estimated model. We then propose and validate a set of techniques based on machine learning and deep learning algorithms for the choice of data traces to be used for model identification. We also show that only via deep learning techniques these traces can be correctly chosen up to 96

READ FULL TEXT
research
09/04/2021

Application Checkpoint and Power Study on Large Scale Systems

Power efficiency is critical in high performance computing (HPC) systems...
research
07/06/2021

Sustaining Performance While Reducing Energy Consumption: A Control Theory Approach

Production high-performance computing systems continue to grow in comple...
research
01/04/2023

Analyzing I/O Performance of a Hierarchical HPC Storage System for Distributed Deep Learning

Today, deep learning is an essential technology for our life. To solve m...
research
03/19/2019

Power and Thermal Analysis of Commercial Mobile Platforms: Experiments and Case Studies

State-of-the-art mobile processors can deliver fast response time and hi...
research
09/03/2018

Machine learning for predicting thermal power consumption of the Mars Express Spacecraft

The thermal subsystem of the Mars Express (MEX) spacecraft keeps the on-...
research
03/30/2021

Thermal Neural Networks: Lumped-Parameter Thermal Modeling With State-Space Machine Learning

With electric power systems becoming more compact and increasingly power...
research
01/04/2022

Understanding Power and Energy Utilization in Large Scale Production Physics Simulation Codes

Power is an often-cited reason for moving to advanced architectures on t...

Please sign up or login with your details

Forgot password? Click here to reset