RepTFD: Replay Based Transient Fault Detection

06/11/2012
by   Lei Li, et al.
0

The advances in IC process make future chip multiprocessors (CMPs) more and more vulnerable to transient faults. To detect transient faults, previous core-level schemes provide redundancy for each core separately. As a result, they may leave transient faults in the uncore parts, which consume over 50 area of a modern CMP, escaped from detection. This paper proposes RepTFD, the first core-level transient fault detection scheme with 100 of providing redundancy for each core separately, RepTFD provides redundancy for a group of cores as a whole. To be specific, it replays the execution of the checked group of cores on a redundant group of cores. Through comparing the execution results between the two groups of cores, all malignant transient faults can be caught. Moreover, RepTFD adopts a novel pending period based record-replay approach, which can greatly reduce the number of execution orders that need to be enforced in the replay-run. Hence, RepTFD brings only 4.76 performance overhead in comparison to the normal execution without fault-tolerance according to our experiments on the RTL design of an industrial CMP named Godson-3. In addition, RepTFD only consumes about 0.83 Godson-3, while needing only trivial modifications to existing components of Godson-3.

READ FULL TEXT

page 14

page 17

page 18

page 19

research
04/11/2023

Enhancement in Reliability for Multi-core system consisting of One Instruction Cores

Rapid CMOS device size reduction resulted in billions of transistors on ...
research
05/25/2022

On-Demand Redundancy Grouping: Selectable Soft-Error Tolerance for a Multicore Cluster

With the shrinking of technology nodes and the use of parallel processor...
research
04/04/2022

FT-EALU: Fault Tolerant Arithmetic and Logic Unit for Critical Embedded and Real time Systems

In this paper, a fault-tolerant approach to mitigate transient and perma...
research
07/28/2023

SafeLS: Toward Building a Lockstep NOEL-V Core

Safety-critical systems such as those in automotive, avionics and space,...
research
04/12/2019

Parity-Based Concurrent Error Detection Schemes for the ChaCha Stream Cipher

We propose two parity-based concurrent error detection schemes for the Q...
research
06/22/2019

ZOFI: Zero-Overhead Fault Injection Tool for Fast Transient Fault Coverage Analysis

The experimental evaluation of fault-tolerance studies relies on tools t...
research
07/20/2018

Self-stabilization Overhead: an Experimental Case Study on Coded Atomic Storage

We study the problem of privately emulating shared memory in message-pas...

Please sign up or login with your details

Forgot password? Click here to reset