Failure Analysis and Quantification for Contemporary and Future Supercomputers

11/05/2019
by   Li Tan, et al.
0

Large-scale computing systems today are assembled by numerous computing units for massive computational capability needed to solve problems at scale, which enables failures common events in supercomputing scenarios. Considering the demanding resilience requirements of supercomputers today, we present a quantitative study on fine-grained failure modeling for contemporary and future large-scale computing systems. We integrate various types of failures from different system hierarchical levels and system components, and summarize the overall system failure rates formally. Given that nowadays system-wise failure rate needs to be capped under a threshold value for reliability and cost-efficiency purposes, we quantitatively discuss different scenarios of system resilience, and analyze the impacts of resilience to different error types on the variation of system failure rates, and the correlation of hierarchical failure rates. Moreover, we formalize and showcase the resilience efficiency of failure-bounded supercomputers today.

READ FULL TEXT
research
11/05/2019

Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications

The persistently growing resilience concerns of large-scale computing sy...
research
04/10/2020

A Resilient AWGR and Server Based PON Data Centre Architecture

This paper studies the resilience of an AWGR and server based PON DCN ar...
research
06/14/2017

Towards Adaptive Resilience in High Performance Computing

Failure rates in high performance computers rapidly increase due to the ...
research
07/21/2021

On ageing properties of lifetime distributions

A reasonable segment of reliability theory is perpetrated to the study o...
research
04/17/2018

Adaptive control in rollforward recovery for extreme scale multigrid

With the increasing number of compute components, failures in future exa...
research
03/18/2019

Quantifying dynamics of failure across science, startups, and security

Human achievements are often preceded by repeated attempts that initiall...
research
02/21/2018

Asymptotic efficiency of restart and checkpointing

Many tasks are subject to failure before completion. Two of the most com...

Please sign up or login with your details

Forgot password? Click here to reset