Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units

Graphics Processing Units (GPUs) are over-stressed to accelerate High-Performance Computing applications and are used to accelerate Deep Neural Networks in several domains where they have a life expectancy of many years. These conditions expose the GPUs hardware to (premature) aging, causing permanent faults to arise after the usual end-of-manufacturing test. Techniques to assess the impact of permanent faults in GPUs are then strongly required, thus allowing to estimate the reliability risk and to possibly mitigate it. In this paper, we present a method to evaluate the effects of permanent faults affecting the GPU scheduler and control units, which are the most peculiar and stressed resources, along with the first figures that allow quantifying these effects. We characterize over 5.83x10^5 permanent fault effects in the scheduler and controllers of a gate-level GPU model. Then, we map the observed error categories in software by instrumenting the code of 13 applications and two convolutional neural networks, injecting more than 1.65x10^5 permanent errors. Our two-level fault injection strategy reduces the evaluation time from hundreds of years of gate-level evaluation to hundreds of hours.We found that faults in the GPU parallelism management units can modify the opcode, the addresses, and the status of thread(s) and warp(s). The large majority (up to 99 Errors affecting the instruction operation or resource management hang the code, while 45 silent data corruptions.

READ FULL TEXT

page 4

page 9

research
05/24/2022

Reliability Assessment of Neural Networks in GPUs: A Framework For Permanent Faults Injections

Currently, Deep learning and especially Convolutional Neural Networks (C...
research
03/02/2021

Representing Gate-Level SET Faults by Multiple SEU Faults at RTL

The advanced complex electronic systems increasingly demand safer and mo...
research
08/30/2023

On-Chip Sensors Data Collection and Analysis for SoC Health Management

Data produced by on-chip sensors in modern SoCs contains a large amount ...
research
12/07/2021

Lightning: Striking the Secure Isolation on GPU Clouds with Transient Hardware Faults

GPU clouds have become a popular computing platform because of the cost ...
research
10/03/2009

Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

Graphics processing units (GPUs) are gaining widespread use in computati...
research
03/14/2023

ISimDL: Importance Sampling-Driven Acceleration of Fault Injection Simulations for Evaluating the Robustness of Deep Learning

Deep Learning (DL) systems have proliferated in many applications, requi...
research
02/21/2019

GPU Acceleration of Real-Time Control Loops

Extreme Ultraviolet (EUV) photolithography is seen as the key enabler fo...

Please sign up or login with your details

Forgot password? Click here to reset