Dependability Analysis of Data Storage Systems in Presence of Soft Errors

12/23/2021
by   Mostafa Kishani, et al.
0

In recent years, high availability and reliability of Data Storage Systems (DSS) have been significantly threatened by soft errors occurring in storage controllers. Due to their specific functionality and hardware-software stack, error propagation and manifestation in DSS is quite different from general-purpose computing architectures. To our knowledge, no previous study has examined the system-level effects of soft errors on the availability and reliability of data storage systems. In this paper, we first analyze the effects of soft errors occurring in the server processors of storage controllers on the entire storage system dependability. To this end, we implemented the major functions of a typical data storage system controller, running on a full stack of storage system operating system, and developed a framework to perform fault injection experiments using a full system simulator. We then propose a new metric, Storage System Vulnerability Factor (SSVF), to accurately capture the impact of soft errors in storage systems. By conducting extensive experiments, it is revealed that depending on the controller configuration, up to 40 unrecoverable soft errors in this part will result in Data Loss (DL) in an irreversible manner. However, soft errors in the rest of cache memory filled by Operating System (OS) and storage applications will result in Data Unavailability (DU) at the storage system level. Our analysis also shows that Detectable Unrecoverable Errors (DUEs) on the cache data field are the major cause of DU in storage systems, while Silent Data Corruptions (SDCs) in the cache tag and data field are mainly the cause of DL in storage systems.

READ FULL TEXT

page 3

page 7

page 10

page 13

page 15

research
05/26/2018

Modeling Impact of Human Errors on the Data Unavailability and Data Loss of Storage Systems

Data storage systems and their availability play a crucial role in conte...
research
12/01/2019

Evaluating Reliability of SSD-Based I/O Caches in Enterprise Storage Systems

In this paper, we present a comprehensive analysis investigating the rel...
research
05/26/2018

Evaluating Impact of Human Errors on the Availability of Data Storage Systems

In this paper, we investigate the effect of incorrect disk replacement s...
research
07/07/2021

R2F: A Remote Retraining Framework for AIoT Processors with Computing Errors

AIoT processors fabricated with newer technology nodes suffer rising sof...
research
08/21/2017

Entirely protecting operating systems against transient errors in space environment

In this article, we propose a mainly-software hardening technique to tot...
research
08/26/2019

Tvarak: Software-managed hardware offload for DAX NVM storage redundancy

Tvarak efficiently implements system-level redundancy for direct-access ...
research
07/28/2016

The Study of Transient Faults Propagation in Multithread Applications

Whereas contemporary Error Correcting Codes (ECC) designs occupy a signi...

Please sign up or login with your details

Forgot password? Click here to reset