Online detection of failures generated by storage simulator

01/18/2021
by   Kenenbek Arzymatov, et al.
0

Modern large-scale data-farms consist of hundreds of thousands of storage devices that span distributed infrastructure. Devices used in modern data centers (such as controllers, links, SSD- and HDD-disks) can fail due to hardware as well as software problems. Such failures or anomalies can be detected by monitoring the activity of components using machine learning techniques. In order to use these techniques, researchers need plenty of historical data of devices in normal and failure mode for training algorithms. In this work, we challenge two problems: 1) lack of storage data in the methods above by creating a simulator and 2) applying existing online algorithms that can faster detect a failure occurred in one of the components. We created a Go-based (golang) package for simulating the behavior of modern storage infrastructure. The software is based on the discrete-event modeling paradigm and captures the structure and dynamics of high-level storage system building blocks. The package's flexible structure allows us to create a model of a real-world storage system with a configurable number of components. The primary area of interest is exploring the storage machine's behavior under stress testing or exploitation in the medium- or long-term for observing failures of its components. To discover failures in the time series distribution generated by the simulator, we modified a change point detection algorithm that works in online mode. The goal of the change-point detection is to discover differences in time series distribution. This work describes an approach for failure detection in time series data based on direct density ratio estimation via binary classifiers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/17/2020

Generalization of Change-Point Detection in Time Series Data Based on Direct Density Ratio Estimation

The goal of the change-point detection is to discover changes of time se...
research
01/01/2019

Large Scale Studies of Memory, Storage, and Network Failures in a Modern Data Center

The workloads running in the modern data centers of large scale Internet...
research
01/07/2022

Bayesian Online Change Point Detection for Baseline Shifts

In time series data analysis, detecting change points on a real-time bas...
research
06/22/2022

An Application of a Modified Beta Factor Method for the Analysis of Software Common Cause Failures

This paper presents an approach for modeling software common cause failu...
research
07/24/2019

Live Forensics for Distributed Storage Systems

We present Kaleidoscope an innovative system that supports live forensic...
research
08/09/2022

Adaptive Partially-Observed Sequential Change Detection and Isolation

High-dimensional data has become popular due to the easy accessibility o...

Please sign up or login with your details

Forgot password? Click here to reset