NVMe and PCIe SSD Monitoring in Hyperscale Data Centers

03/25/2020
by   Nikhil Khatri, et al.
0

With low latency, high throughput and enterprise-grade reliability, SSDs have become the de-facto choice for storage in the data center. As a result, SSDs are used in all online data stores in LinkedIn. These apps persist and serve critical user data and have millisecond latencies. For the hosts serving these applications, SSD faults are the single largest cause of failure. Frequent SSD failures result in significant downtime for critical applications. They also generate a significant downstream RCA (Root Cause Analysis) load for systems operations teams. A lack of insight into the runtime characteristics of these drives results in limited ability to provide accurate RCAs for such issues and hinders the ability to provide credible, long term fixes to such issues. In this paper we describe the system developed at LinkedIn to facilitate the real-time monitoring of SSDs and the insights we gained into failure characteristics. We describe how we used that insight to perform predictive maintenance and present the resulting reduction of man-hours spent on maintenance.

READ FULL TEXT
research
02/12/2021

Interpretable Predictive Maintenance for Hard Drives

Existing machine learning approaches for data-driven predictive maintena...
research
07/27/2018

NDBench: Benchmarking Microservices at Scale

Software vendors often report performance numbers for the sweet spot or ...
research
06/13/2022

Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

Fault diagnosis is critical in many domains, as faults may lead to safet...
research
12/22/2020

The Life and Death of SSDs and HDDs: Similarities, Differences, and Prediction Models

Data center downtime typically centers around IT equipment failure. Stor...
research
04/25/2020

Real-Time Anomaly Detection in Data Centers for Log-based Predictive Maintenance using an Evolving Fuzzy-Rule-Based Approach

Detection of anomalous behaviors in data centers is crucial to predictiv...
research
09/06/2023

TFBEST: Dual-Aspect Transformer with Learnable Positional Encoding for Failure Prediction

Hard Disk Drive (HDD) failures in datacenters are costly - from catastro...
research
04/29/2022

La Résistance: Harnessing Heterogeneous Resources for Adaptive Resiliency in 6G Networks

Recent years have seen more critical applications designed to protect hu...

Please sign up or login with your details

Forgot password? Click here to reset