Large Scale Studies of Memory, Storage, and Network Failures in a Modern Data Center

01/01/2019
by   Justin Meza, et al.
0

The workloads running in the modern data centers of large scale Internet service providers (such as Amazon, Baidu, Facebook, Google, and Microsoft) support billions of users and span globally distributed infrastructure. Yet, the devices used in modern data centers fail due to a variety of causes, from faulty components to bugs to misconfiguration. Faulty devices make operating large scale data centers challenging because the workloads running in modern data centers consist of interdependent programs distributed across many servers, so failures that are isolated to a single device can still have a widespread effect on a workload. In this dissertation, we measure and model the device failures in a large scale Internet service company, Facebook. We focus on three device types that form the foundation of Internet service data center infrastructure: DRAM for main memory, SSDs for persistent storage, and switches and backbone links for network connectivity. For each of these device types, we analyze long term device failure data broken down by important device attributes and operating conditions, such as age, vendor, and workload. We also build and release statistical models to examine the failure trends for the devices we analyze. Our key conclusion in this dissertation is that we can gain a deep understanding of why devices fail---and how to predict their failure---using measurement and modeling. We hope that the analysis, techniques, and models we present in this dissertation will enable the community to better measure, understand, and prepare for the hardware reliability challenges we face in the future.

READ FULL TEXT

page 26

page 42

research
01/18/2021

Online detection of failures generated by storage simulator

Modern large-scale data-farms consist of hundreds of thousands of storag...
research
12/22/2020

The Life and Death of SSDs and HDDs: Similarities, Differences, and Prediction Models

Data center downtime typically centers around IT equipment failure. Stor...
research
05/17/2022

A Survey on Machine Learning for Geo-Distributed Cloud Data Center Management

Cloud workloads today are typically managed in a distributed environment...
research
08/03/2018

A Stochastic Model for File Lifetime and Security in Data Center Networks

Data center networks are an important infrastructure in various applicat...
research
07/27/2018

NDBench: Benchmarking Microservices at Scale

Software vendors often report performance numbers for the sweet spot or ...
research
08/16/2020

Dependability Evaluation of Middleware Technology for Large-scale Distributed Caching

Distributed caching systems (e.g., Memcached) are widely used by service...
research
10/29/2019

Disaggregation and the Application

This paper examines disaggregated data center architectures from the per...

Please sign up or login with your details

Forgot password? Click here to reset