Exploring Fault-Tolerant Erasure Codes for Scalable All-Flash Array Clusters

06/12/2019
by   Sungjoon Koh, et al.
0

Large-scale systems with all-flash arrays have become increasingly common in many computing segments. To make such systems resilient, we can adopt erasure coding such as Reed-Solomon (RS) code as an alternative to replication because erasure coding incurs a significantly lower storage overhead than replication. To understand the impact of using erasure coding on the system performance and other system aspects such as CPU utilization and network traffic, we build a storage cluster that consists of approximately 100 processor cores with more than 50 high-performance solid-state drives (SSDs), and evaluate the cluster with a popular open-source distributed parallel file system, called Ceph. Specifically, we analyze the behaviors of a system adopting erasure coding from the following five viewpoints, and compare with those of another system using replication: (1) storage system I/O performance; (2) computing and software overheads; (3) I/O amplification; (4) network traffic among storage nodes, and (5) impact of physical data layout on performance of RS-coded SSD arrays. For all these analyses, we examine two representative RS configurations, used by Google file systems, and compare them with triple replication employed by a typical parallel file system as a default fault tolerance mechanism. Lastly, we collect 96 block-level traces from the cluster and release them to the public domain for the use of other researchers.

READ FULL TEXT

page 2

page 7

page 8

page 11

page 12

page 14

page 15

page 16

research
09/14/2017

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems

Large-scale systems with arrays of solid state disks (SSDs) have become ...
research
06/20/2022

Building Blocks for Network-Accelerated Distributed File Systems

High-performance clusters and datacenters pose increasingly demanding re...
research
12/27/2018

Extending TCP for Accelerating Replication on Cluster File Systems over SDNs

This paper explores the changes required of TCP to efficiently support c...
research
03/04/2018

Applied Erasure Coding in Networks and Distributed Storage

The amount of digital data is rapidly growing. There is an increasing us...
research
08/12/2020

The network footprint of replication in popular DBMSs

Database replication is an important component of reliable, disaster tol...
research
01/31/2022

Fragmented ARES: Dynamic Storage for Large Objects

Data availability is one of the most important features in distributed s...
research
03/21/2018

A Robust Fault-Tolerant and Scalable Cluster-wide Deduplication for Shared-Nothing Storage Systems

Deduplication has been largely employed in distributed storage systems t...

Please sign up or login with your details

Forgot password? Click here to reset