Silent Data Corruptions at Scale

02/22/2021
by   Harish Dattatraya Dixit, et al.
0

Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure services. SDCs are not captured by error reporting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months of debug engineering time. In this paper, we describe common defect types observed in silicon manufacturing that leads to SDCs. We discuss a real-world example of silent data corruption within a datacenter application. We provide the debug flow followed to root-cause and triage faulty instructions within a CPU using a case study, as an illustration on how to debug this class of errors. We provide a high-level overview of the mitigations to reduce the risk of silent data corruptions within a large production fleet. In our large-scale infrastructure, we have run a vast library of silent error test scenarios across hundreds of thousands of machines in our fleet. This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations. We have monitored SDCs for a period longer than 18 months. Based on this experience, we determine that reducing silent data corruptions requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/16/2022

Detecting silent data corruptions in the wild

Silent Errors within hardware devices occur when an internal defect mani...
research
11/01/2019

Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment

Root cause analysis in a large-scale production environment is challengi...
research
11/01/2019

Fast Dimensional Analysis for Root Cause Investigation in Large-Scale Service Environment

Root cause analysis in a large-scale production environment is challengi...
research
01/15/2022

Large-Scale Inventory Optimization: A Recurrent-Neural-Networks-Inspired Simulation Approach

Many large-scale production networks include thousands types of final pr...
research
07/26/2020

Approaches of large-scale images recognition with more than 50,000 categoris

Though current CV models have been able to achieve high levels of accura...
research
06/23/2020

Lumos: A Library for Diagnosing Metric Regressions in Web-Scale Applications

Web-scale applications can ship code on a daily to weekly cadence. These...

Please sign up or login with your details

Forgot password? Click here to reset