HyCA: A Hybrid Computing Architecture for Fault Tolerant Deep Learning

06/09/2021
by   Cheng Liu, et al.
0

Hardware faults on the regular 2-D computing array of a typical deep learning accelerator (DLA) can lead to dramatic prediction accuracy loss. Prior redundancy design approaches typically have each homogeneous redundant processing element (PE) to mitigate faulty PEs for a limited region of the 2-D computing array rather than the entire computing array to avoid the excessive hardware overhead. However, they fail to recover the computing array when the number of faulty PEs in any region exceeds the number of redundant PEs in the same region. The mismatch problem deteriorates when the fault injection rate rises and the faults are unevenly distributed. To address the problem, we propose a hybrid computing architecture (HyCA) for fault-tolerant DLAs. It has a set of dot-production processing units (DPPUs) to recompute all the operations that are mapped to the faulty PEs despite the faulty PE locations. According to our experiments, HyCA shows significantly higher reliability, scalability, and performance with less chip area penalty when compared to the conventional redundancy approaches. Moreover, by taking advantage of the flexible recomputing, HyCA can also be utilized to scan the entire 2-D computing array and detect the faulty PEs effectively at runtime.

READ FULL TEXT

page 1

page 4

page 7

research
02/11/2018

Analyzing and Mitigating the Impact of Permanent Faults on a Systolic Array Based Neural Network Accelerator

Due to their growing popularity and computational cost, deep neural netw...
research
07/04/2018

A New Paradigm for Fault-Tolerant Computing with Interconnect Crosstalks

The CMOS integrated chips at advanced technology nodes are becoming more...
research
06/19/2020

Design of a Near-Ideal Fault-Tolerant Routing Algorithm for Network-on-Chip-Based Multicores

With relentless CMOS technology downsizing Networks-on-Chips (NoCs) are ...
research
05/21/2023

Reduce: A Framework for Reducing the Overheads of Fault-Aware Retraining

Fault-aware retraining has emerged as a prominent technique for mitigati...
research
04/05/2022

Fault-Tolerant Deep Learning: A Hierarchical Perspective

With the rapid advancements of deep learning in the past decade, it can ...
research
09/17/2023

An Auto-Parallelizer for Distributed Computing in Haskell

One of the main challenges in distributed computing is building interfac...
research
05/12/2014

Heterogeneity-aware Fault Tolerance using a Self-Organizing Runtime System

Due to the diversity and implicit redundancy in terms of processing unit...

Please sign up or login with your details

Forgot password? Click here to reset