Enabling Software Resilience in GPGPU Applications via Partial Thread Protection

03/04/2021
by   Lishan Yang, et al.
0

Graphics Processing Units (GPUs) are widely used by various applications in a broad variety of fields to accelerate their computation but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. By taking advantage of a general purpose GPU application hierarchical organization in threads, warps, and cooperative thread arrays, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. This allows engaging partial replication mechanisms for error detection/correction at the warp level. By exploring 12 benchmarks (17 kernels) from 4 benchmark suites, we illustrate that threads can be remapped into reliable or unreliable warps with only 1.63 protection via replication to those groups of threads that truly need it. Furthermore, we show that thread remapping to different warps does not sacrifice application performance. We show how this remapping facilitates warp replication for error detection and/or correction and achieves an average reduction of 20.61 standard duplication/triplication.

READ FULL TEXT

page 1

page 7

research
02/13/2021

MOARD: Modeling Application Resilience to Transient Faults on Data Objects

Understanding application resilience (or error tolerance) in the presenc...
research
07/16/2020

Soft Errors Detection and Automatic Recovery based on Replication combined with different Levels of Checkpointing

Handling faults is a growing concern in HPC. In future exascale systems,...
research
09/01/2023

The Case for Replication-Aware Memory-Error Protection in Disaggregated Memory

Disaggregated memory leverages recent technology advances in high-densit...
research
10/19/2020

Towards Distributed Software Resilience in Asynchronous Many-Task Programming Models

Exceptions and errors occurring within mission critical applications due...
research
04/15/2020

Implementing Software Resiliency in HPX for Extreme Scale Computing

Exceptions and errors occurring within mission critical applications due...
research
02/22/2020

HarDNN: Feature Map Vulnerability Evaluation in CNNs

As Convolutional Neural Networks (CNNs) are increasingly being employed ...
research
09/28/2017

Tolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer Exploration for Architecting Resilience)

We present CLEAR (Cross-Layer Exploration for Architecting Resilience), ...

Please sign up or login with your details

Forgot password? Click here to reset