Evaluating and Enhancing Robustness of Deep Recommendation Systems Against Hardware Errors

07/17/2023
by   Dongning Ma, et al.
0

Deep recommendation systems (DRS) heavily depend on specialized HPC hardware and accelerators to optimize energy, efficiency, and recommendation quality. Despite the growing number of hardware errors observed in large-scale fleet systems where DRS are deployed, the robustness of DRS has been largely overlooked. This paper presents the first systematic study of DRS robustness against hardware errors. We develop Terrorch, a user-friendly, efficient and flexible error injection framework on top of the widely-used PyTorch. We evaluate a wide range of models and datasets and observe that the DRS robustness against hardware errors is influenced by various factors from model parameters to input characteristics. We also explore 3 error mitigation methods including algorithm based fault tolerance (ABFT), activation clipping and selective bit protection (SBP). We find that applying activation clipping can recover up to 30 mitigation method.

READ FULL TEXT

page 6

page 7

page 10

research
12/07/2022

Assessing and Analyzing the Resilience of Graph Neural Networks Against Hardware Faults

Graph neural networks (GNNs) have recently emerged as a promising learni...
research
05/31/2023

Special Session: Approximation and Fault Resiliency of DNN Accelerators

Deep Learning, and in particular, Deep Neural Network (DNN) is nowadays ...
research
03/10/2022

SoftSNN: Low-Cost Fault Tolerance for Spiking Neural Network Accelerators under Soft Errors

Specialized hardware accelerators have been designed and employed to max...
research
09/02/2019

Algorithm-Based Fault Tolerance for Parallel Stencil Computations

The increase in HPC systems size and complexity, together with increasin...
research
10/24/2020

LCFI: A Fault Injection Tool for Studying Lossy Compression Error Propagation in HPC Programs

Error-bounded lossy compression is becoming more and more important to t...
research
03/16/2022

Detecting silent data corruptions in the wild

Silent Errors within hardware devices occur when an internal defect mani...
research
10/27/2019

A Case for Quantifying Statistical Robustness of Specialized Probabilistic AI Accelerators

Statistical machine learning often uses probabilistic algorithms, such a...

Please sign up or login with your details

Forgot password? Click here to reset