Tolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer Exploration for Architecting Resilience)

09/28/2017
by   Eric Cheng, et al.
0

We present CLEAR (Cross-Layer Exploration for Architecting Resilience), a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm). This is also referred to as cross-layer resilience. In this paper, we focus on radiation-induced soft errors in processor cores. We address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) in terrestrial environments. Our framework automatically and systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack (586 cross-layer combinations in this paper), derives cost-effective solutions that achieve resilience targets at minimal costs, and provides guidelines for the design of new resilience techniques. Our results demonstrate that a carefully optimized combination of circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for general-purpose processor cores. For example, a 50x improvement in silent data corruption rate is achieved at only 2.1 in-order core) with no speed impact. However, (application-aware) selective circuit-level hardening alone, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a cost-effective soft error resilience solution as well (with 1 improvement in silent data corruption rate).

READ FULL TEXT
research
02/18/2022

Lightweight Soft Error Resilience for In-Order Cores

Acoustic-sensor-based soft error resilience is particularly promising, s...
research
02/22/2018

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing

Resiliency is the ability of large-scale high-performance computing (HPC...
research
01/13/2020

SERAD: Soft Error Resilient Asynchronous Design using a Bundled Data Protocol

The risk of soft errors due to radiation continues to be a significant c...
research
11/19/2018

Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors

Reliability has emerged as a key topic of interest for researchers aroun...
research
03/04/2021

Enabling Software Resilience in GPGPU Applications via Partial Thread Protection

Graphics Processing Units (GPUs) are widely used by various applications...
research
04/06/2021

Towards Soft Circuit Breaking in Service Meshes via Application-agnostic Caching

Service meshes factor out code dealing with inter-micro-service communic...

Please sign up or login with your details

Forgot password? Click here to reset