NVMe and PCIe SSD Monitoring in Hyperscale Data Centers
With low latency, high throughput and enterprise-grade reliability, SSDs have become the de-facto choice for storage in the data center. As a result, SSDs are used in all online data stores in LinkedIn. These apps persist and serve critical user data and have millisecond latencies. For the hosts serving these applications, SSD faults are the single largest cause of failure. Frequent SSD failures result in significant downtime for critical applications. They also generate a significant downstream RCA (Root Cause Analysis) load for systems operations teams. A lack of insight into the runtime characteristics of these drives results in limited ability to provide accurate RCAs for such issues and hinders the ability to provide credible, long term fixes to such issues. In this paper we describe the system developed at LinkedIn to facilitate the real-time monitoring of SSDs and the insights we gained into failure characteristics. We describe how we used that insight to perform predictive maintenance and present the resulting reduction of man-hours spent on maintenance.
READ FULL TEXT