DeepAI AI Chat
Log In Sign Up

Reduce: A Framework for Reducing the Overheads of Fault-Aware Retraining

by   Muhammad Abdullah Hanif, et al.
NYU college

Fault-aware retraining has emerged as a prominent technique for mitigating permanent faults in Deep Neural Network (DNN) hardware accelerators. However, retraining leads to huge overheads, specifically when used for fine-tuning large DNNs designed for solving complex problems. Moreover, as each fabricated chip can have a distinct fault pattern, fault-aware retraining is required to be performed for each chip individually considering its unique fault map, which further aggravates the problem. To reduce the overall retraining cost, in this work, we introduce the concept of resilience-driven retraining amount selection. To realize this concept, we propose a novel framework, Reduce, that, at first, computes the resilience of the given DNN to faults at different fault rates and with different amounts of retraining. Then, based on the resilience, it computes the amount of retraining required for each chip considering its unique fault map. We demonstrate the effectiveness of our methodology for a systolic array-based DNN accelerator experiencing permanent faults in the computational array.


page 1

page 2


Analyzing and Mitigating the Impact of Permanent Faults on a Systolic Array Based Neural Network Accelerator

Due to their growing popularity and computational cost, deep neural netw...

MoRS: An Approximate Fault Modelling Framework for Reduced-Voltage SRAMs

On-chip memory (usually based on Static RAMs-SRAMs) are crucial componen...

Bayesian Assessment of a Connectionist Model for Fault Detection

A previous paper [2] showed how to generate a linear discriminant networ...

HyCA: A Hybrid Computing Architecture for Fault Tolerant Deep Learning

Hardware faults on the regular 2-D computing array of a typical deep lea...