RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial

06/14/2023

∙

This is a followup to the 1994 tutorial by Berkeley RAID researchers whose 1988 RAID paper foresaw a revolutionary change in storage industry based on advances in magnetic disk technology, i.e., replacement of large capacity expensive disks with arrays of small capacity inexpensive disks. NAND flash SSDs which use less power, incur very low latency, provide high bandwidth, and are more reliable than HDDs are expected to replace HDDs as their prices drop. Replication in the form of mirrored disks and erasure coding via parity and Reed-Solomon codes are two methods to achieve higher reliability through redundancy in disk arrays. RAID(4+k), k=1,2,... arrays utilizing k check strips makes them k-disk-failure-tolerant with maximum distance separable coding with minimum redundancy. Clustered RAID, local recovery codes, partial MDS, and multilevel RAID are proposals to improve RAID reliability and performance. We discuss RAID5 performance and reliability analysis in conjunction with HDDs w/o and with latent sector errors - LSEs, which can be dealt with by intradisk redundancy and disk scrubbing, the latter enhanced with machine learning algorithms. Undetected disk errors causing silent data corruption are propagated by rebuild. We utilize the M/G/1 queueing model for RAID5 performance evaluation, present approximations for fork/join response time in degraded mode analysis, and the vacationing server model for rebuild analysis. Methods and tools for reliability evaluation with Markov chain modeling and simulation are discussed. Queueing and reliability analysis are based on probability theory and stochastic processes so that the two topics can be studied together. Their application is presented here in the context of RAID arrays in a tutorial manner.

READ FULL TEXT

RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial

Sign in with Google

Consider DeepAI Pro