Live Forensics for Distributed Storage Systems

07/24/2019
by   Saurabh Jha, et al.
0

We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key features: 1) using temporal and spatial differential observability for end-to-end performance monitoring of I/O requests, 2) modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements, and, 3) observing differences in reliability and performance metrics between similar types of healthy and unhealthy components to attribute the most likely root causes. We deployed Kaleidoscope on PetaStore and our evaluation shows that Kaleidoscope can run live forensics at 5-minute intervals and pinpoint the root causes of 95.8 overhead.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/22/2020

The Life and Death of SSDs and HDDs: Similarities, Differences, and Prediction Models

Data center downtime typically centers around IT equipment failure. Stor...
research
08/20/2023

Demystifying the Performance of Data Transfers in High-Performance Research Networks

High-speed research networks are built to meet the ever-increasing needs...
research
04/19/2016

Improving Raw Image Storage Efficiency by Exploiting Similarity

To improve the temporal and spatial storage efficiency, researchers have...
research
01/18/2021

Online detection of failures generated by storage simulator

Modern large-scale data-farms consist of hundreds of thousands of storag...
research
09/12/2023

RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design

Software-defined networking (SDN) and software-defined flash (SDF) have ...
research
08/01/2021

Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings

For large-scale distributed systems, it's crucial to efficiently diagnos...
research
12/03/2020

Technical Report: Selective Imaging of File System Data on Live Systems

In contrast to the common habit of taking full bitwise copies of storage...

Please sign up or login with your details

Forgot password? Click here to reset