A Modeling Framework for Reliability of Erasure Codes in SSD Arrays

12/23/2021
by   Mostafa Kishani, et al.
0

To help reliability of SSD arrays, Redundant Array of Independent Disks (RAID) are commonly employed. However, the conventional reliability models of HDD RAID cannot be applied to SSD arrays, as the nature of failures in SSDs are different from HDDs. Previous studies on the reliability of SSD arrays are based on the deprecated SSD failure data, and only focus on limited failure types, device failures, and page failures caused by the bit errors, while recent field studies have reported other failure types including bad blocks and bad chips, and a high correlation between failures. In this paper, we explore the reliability of SSD arrays using field storage traces and real-system implementation of conventional and emerging erasure codes. The reliability is evaluated by statistical fault injections that post-process the usage logs from the real-system implementation, while the fault/failure attributes are obtained from field data. As a case study, we examine conventional and emerging erasure codes in terms of both reliability and performance using Linux MD RAID and commercial SSDs. Our analysis shows that a) emerging erasure codes fail to replace RAID6 in terms of reliability, b) row-wise erasure codes are the most efficient choices for contemporary SSD devices, and c) previous models overestimate the SSD array reliability by up to six orders of magnitude, as they focus on the coincidence of bad pages and bad chips that roots the minority of Data Loss (DL) in SSD arrays. Our experiments show that the combination of bad chips with bad blocks is the major source of DL in RAID5 and emerging codes (contributing more than 54 codes, respectively), while RAID6 remains robust under these failure combinations. Finally, the fault injection results show that SSD array reliability, as well as the failure breakdown is significantly correlated with SSD type.

READ FULL TEXT

page 1

page 4

page 11

page 15

research
05/26/2018

Modeling Impact of Human Errors on the Data Unavailability and Data Loss of Storage Systems

Data storage systems and their availability play a crucial role in conte...
research
08/14/2020

The Relevance of Classic Fuzz Testing: Have We Solved This One?

As fuzz testing has passed its 30th anniversary, and in the face of the ...
research
01/26/2018

Mirrored and Hybrid Disk Arrays: Organization, Scheduling, Reliability, and Performance

Basic mirroring (BM) classified as RAID level 1 replicates data on two d...
research
04/29/2018

Investigating Power Outage Effects on Reliability of Solid-State Drives

Solid-State Drives (SSDs) are recently employed in enterprise servers an...
research
06/14/2023

RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial

This is a followup to the 1994 tutorial by Berkeley RAID researchers who...
research
05/12/2022

Optimizing Apportionment of Redundancies in Hierarchical RAID

Large disk arrays are organized into storage nodes – SNs or bricks with ...
research
01/31/2019

A test bed for measuring UAV servo reliability

The era of Unmanned aviation is flourishing and advancing in leaps and b...

Please sign up or login with your details

Forgot password? Click here to reset